article-scrapper

Pricing

$3.00 / 1,000 result_sets

article-scrapper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs

Pricing

$3.00 / 1,000 result_sets

Rating

0.0

(0)

Developer

RK K

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

23 days ago

Last modified

Tech News Article Scraper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs provided by you.

Features

Universal Scraping: Generic scraper works with most tech news sites
Preset Sites: Pre-configured settings for popular tech news sources
Custom URLs: Scrape any tech news site by providing URLs
Smart Extraction: Automatically detects article content, titles, authors, dates, images, and tags
Advanced Filtering: Filter by keywords, title text, or description content
Flexible Configuration: Control article count, pagination, and content inclusion
Error Handling: Robust retry logic and graceful error handling
Rate Limiting: Built-in delays to respect website resources

Supported Preset Sites

The following sites are pre-configured for easy scraping:

The Verge - Tech news and media
CNET - Tech product reviews and news
Wired - Technology and culture
TechCrunch - Startup and technology news
Ars Technica - Technology news and analysis
Engadget - Consumer electronics and gadgets
The Guardian Tech - Technology news from The Guardian
The Next Web - International technology news

How It Works

This actor uses a generic scraper that adapts to different news site structures:

Detects Article Links: Uses multiple strategies to find article URLs on listing pages
- Looks for <article> tags
- Searches headings (h1, h2, h3) for links
- Finds elements with article/post/story classes
- Identifies URLs with date patterns (/2024/, /2025/)
Extracts Content: Uses fallback strategies for each field
- Title: h1 tag → og:title → twitter:title → title tag
- Author: rel="author" → author classes → itemprop="author" → meta tags
- Date: time[datetime] → datePublished → meta tags
- Content: article tag → articleBody → content classes
- Summary: og:description → meta description → intro paragraph
- Image: og:image → twitter:image → first article image
- Tags: rel="tag" → category links → tag classes
Handles Edge Cases: Normalizes URLs, filters non-articles, removes duplicates

Usage Examples

Example 1: Scrape Latest Articles from The Verge

{
  "usePresets": true,
  "presetSources": ["verge"],
  "maxArticlesPerSource": 20,
  "maxPages": 1,
  "includeContent": true
}

Example 2: Scrape Multiple Tech Sites (Light Mode)

Get summaries without full content to save time:

{
  "usePresets": true,
  "presetSources": ["verge", "techcrunch", "wired", "arstechnica"],
  "maxArticlesPerSource": 10,
  "maxPages": 1,
  "includeContent": false
}

Example 3: Scrape Custom Tech Blogs

{
  "usePresets": false,
  "customUrls": [
    "https://www.theverge.com",
    "https://news.ycombinator.com",
    "https://9to5mac.com"
  ],
  "maxArticlesPerSource": 15,
  "maxPages": 2,
  "includeContent": true
}

Example 4: Deep Scrape a Single Site

Get many articles from one source:

{
  "usePresets": false,
  "customUrls": ["https://techcrunch.com"],
  "maxArticlesPerSource": 50,
  "maxPages": 5,
  "includeContent": true
}

Example 5: Search for Specific Topics (AI Articles)

Filter articles by keywords in title or description:

{
  "usePresets": true,
  "presetSources": ["verge", "techcrunch", "wired"],
  "maxArticlesPerSource": 20,
  "searchKeywords": ["AI", "artificial intelligence", "ChatGPT", "GPT-4"]
}

Example 6: Filter by Title

Only scrape articles with "iPhone" in the title:

{
  "usePresets": true,
  "presetSources": ["verge", "cnet"],
  "maxArticlesPerSource": 15,
  "titleContains": "iPhone"
}

Example 7: Filter by Description

Only scrape articles about a specific topic in the description:

{
  "usePresets": false,
  "customUrls": ["https://techcrunch.com"],
  "maxArticlesPerSource": 25,
  "descriptionContains": "startup funding"
}

Example 8: Combine Multiple Filters

Search for AI articles with "OpenAI" in the title:

{
  "usePresets": true,
  "presetSources": ["verge", "techcrunch", "arstechnica"],
  "maxArticlesPerSource": 30,
  "searchKeywords": ["AI", "ChatGPT"],
  "titleContains": "OpenAI"
}

Input Parameters

Parameter	Type	Default	Description
`usePresets`	boolean	`true`	Use predefined sites or custom URLs
`presetSources`	array	`["verge", "techcrunch", "wired"]`	List of preset site keys
`customUrls`	array	`[]`	Custom URLs to scrape (when usePresets is false)
`maxArticlesPerSource`	integer	`10`	Max articles per source (1-100)
`maxPages`	integer	`1`	Max listing pages to check (1-10)
`includeContent`	boolean	`true`	Extract full article text
`searchKeywords`	array	`[]`	Filter articles by keywords (matches ANY keyword)
`titleContains`	string	`""`	Only scrape articles with this text in title
`descriptionContains`	string	`""`	Only scrape articles with this text in description

Filtering Behavior

searchKeywords: Articles matching ANY of the keywords in title OR description will be included
titleContains: Only articles with this exact text (case-insensitive) in the title
descriptionContains: Only articles with this exact text (case-insensitive) in the summary/description
Combined filters: All specified filters must match (AND logic)

Output Format

Articles are saved to the dataset in the following format:

{
  "title": "Article Title",
  "url": "https://example.com/article",
  "author": "Author Name",
  "published_date": "2025-11-08T12:00:00+00:00",
  "content": "Full article text...",
  "summary": "Brief summary or excerpt",
  "source": "Site Name",
  "tags": ["tag1", "tag2"],
  "image_url": "https://example.com/image.jpg",
  "scraped_at": "2025-11-08T13:45:00+00:00"
}

Viewing Results

After a successful run:

Dataset Tab: View all scraped articles in a table
Export Options: Download as JSON, CSV, XML, or Excel
API Access: Access results programmatically via Apify API
Schedule: Set up periodic scraping (hourly, daily, weekly)

Best Practices

Start Small: Test with a few articles before scaling up
Respect Robots.txt: The scraper respects robots.txt automatically
Rate Limiting: Built-in delays prevent overwhelming servers
Monitor Logs: Check logs for warnings about missing content
Adjust Parameters: If scraping fails, try reducing maxArticlesPerSource

Limitations

Some sites may use JavaScript-heavy rendering (not supported by this scraper)
Paywalled content cannot be extracted
Sites with anti-scraping measures may block requests
Content structure varies; generic extraction may miss some fields

Troubleshooting

No articles found

Check if the URL is correct and accessible in a browser
Verify the site allows scraping (check robots.txt)
Try increasing maxPages parameter
Some sites require JavaScript - this scraper uses static HTML only

Missing content fields

Some sites have unique structures that the generic scraper might miss
This is normal - not all sites have all fields
The scraper uses multiple fallback strategies, but some data may be unavailable

Connection timeouts

Check your internet connection
Some sites may be blocking automated requests
Try reducing maxArticlesPerSource to avoid overwhelming the target site

Rate limiting errors

Reduce maxArticlesPerSource (try 5-10 instead of 50+)
Scrape fewer sources per run
Wait between runs

FAQ

Q: Can I scrape paywalled content? A: No, this scraper only accesses publicly available content. It respects the same limitations as a regular web browser.

Q: How fast is the scraper? A: Approximately 1-2 articles per second, depending on site response time and content size. Built-in delays ensure polite scraping.

Q: Can I scrape sites not in the preset list? A: Yes! Use the customUrls option to scrape any tech news site.

Q: Will this work with JavaScript-heavy sites? A: This scraper uses static HTML parsing. For JavaScript-heavy sites (React, Vue, etc.), you may need a browser-based solution.

Q: How do I schedule automated scraping? A: Use Apify's scheduler feature to run hourly, daily, or weekly.

Q: Can I export data to a database? A: Yes, the scraper outputs JSON which can be easily imported into databases. On Apify, you can use integrations to automatically push data to various services.

Q: Is this legal? A: Web scraping legality depends on the website's terms of service and your jurisdiction. Always:

Check the site's robots.txt
Review their Terms of Service
Don't overload their servers
Use data responsibly

This tool is for educational and legitimate use cases only.

License

This project is provided as-is for educational and legitimate scraping purposes. Always respect website terms of service and robots.txt files.

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

276

4.4

Tech News Article Scraper

inquisitive_sarangi/news-article-scraper

Tech News Article Scraper is a simple yet powerful tool to extract news articles from a variety of popular news websites. Supported The Verge, CNET, Wired, TechCrunch, Ars Technica

API Master

Advanced News Scraper

dorcy/advanced-news-scraper

This scraper is crafted to extract the latest news articles based on custom search queries, providing a wealth of information, including article titles, sources, publication dates, full article text, and AI-generated summary.

Dorcy Shema

232

Google News Scraper

easyapi/google-news-scraper

Powerful Google News scraper, collect up to 5000 news articles with flexible search options, language support. Perfect for news aggregation, market research, and sentiment analysis. 📰🔍

EasyApi

618

4.6

Google News Scraper

futurizerush/google-news-scraper

Google News Search Scraper - Real-time news aggregation from Google News. Features smart article enrichment with full content extraction. Perfect for market research, trend analysis, and content monitoring.

Futurize Rush

5.0

Articles and Content Scraper

web.harvester/articles-and-content-scraper

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Web Harvester

5.0

Google News Scraper (Pay Per Result)

data_xplorer/google-news-scraper-fast

⚡️ Extract real-time news including Images and Descriptions from Google News with our powerful scraper. Get comprehensive structured data including titles, sources, publication dates and full article summaries. Perfect for news monitoring, market research and content aggregation.

Data Xplorer

377

5.0

Google News Scraper

epctex/google-news-scraper

Unlock timely news insights with our Google News data retrieval tool. Get the latest news on any news at any time, and more. Effortless and powerful. 📰🔍 #NewsData