article-scrapper
Pricing
$3.00 / 1,000 result_sets
article-scrapper
A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs
Tech News Article Scraper
A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs provided by you.
Features
- Universal Scraping: Generic scraper works with most tech news sites
- Preset Sites: Pre-configured settings for popular tech news sources
- Custom URLs: Scrape any tech news site by providing URLs
- Smart Extraction: Automatically detects article content, titles, authors, dates, images, and tags
- Advanced Filtering: Filter by keywords, title text, or description content
- Flexible Configuration: Control article count, pagination, and content inclusion
- Error Handling: Robust retry logic and graceful error handling
- Rate Limiting: Built-in delays to respect website resources
Supported Preset Sites
The following sites are pre-configured for easy scraping:
- The Verge - Tech news and media
- CNET - Tech product reviews and news
- Wired - Technology and culture
- TechCrunch - Startup and technology news
- Ars Technica - Technology news and analysis
- Engadget - Consumer electronics and gadgets
- The Guardian Tech - Technology news from The Guardian
- The Next Web - International technology news
How It Works
This actor uses a generic scraper that adapts to different news site structures:
-
Detects Article Links: Uses multiple strategies to find article URLs on listing pages
- Looks for
<article>tags - Searches headings (h1, h2, h3) for links
- Finds elements with article/post/story classes
- Identifies URLs with date patterns (/2024/, /2025/)
- Looks for
-
Extracts Content: Uses fallback strategies for each field
- Title: h1 tag → og:title → twitter:title → title tag
- Author: rel="author" → author classes → itemprop="author" → meta tags
- Date: time[datetime] → datePublished → meta tags
- Content: article tag → articleBody → content classes
- Summary: og:description → meta description → intro paragraph
- Image: og:image → twitter:image → first article image
- Tags: rel="tag" → category links → tag classes
-
Handles Edge Cases: Normalizes URLs, filters non-articles, removes duplicates
Usage Examples
Example 1: Scrape Latest Articles from The Verge
{"usePresets": true,"presetSources": ["verge"],"maxArticlesPerSource": 20,"maxPages": 1,"includeContent": true}
Example 2: Scrape Multiple Tech Sites (Light Mode)
Get summaries without full content to save time:
{"usePresets": true,"presetSources": ["verge", "techcrunch", "wired", "arstechnica"],"maxArticlesPerSource": 10,"maxPages": 1,"includeContent": false}
Example 3: Scrape Custom Tech Blogs
{"usePresets": false,"customUrls": ["https://www.theverge.com","https://news.ycombinator.com","https://9to5mac.com"],"maxArticlesPerSource": 15,"maxPages": 2,"includeContent": true}
Example 4: Deep Scrape a Single Site
Get many articles from one source:
{"usePresets": false,"customUrls": ["https://techcrunch.com"],"maxArticlesPerSource": 50,"maxPages": 5,"includeContent": true}
Example 5: Search for Specific Topics (AI Articles)
Filter articles by keywords in title or description:
{"usePresets": true,"presetSources": ["verge", "techcrunch", "wired"],"maxArticlesPerSource": 20,"searchKeywords": ["AI", "artificial intelligence", "ChatGPT", "GPT-4"]}
Example 6: Filter by Title
Only scrape articles with "iPhone" in the title:
{"usePresets": true,"presetSources": ["verge", "cnet"],"maxArticlesPerSource": 15,"titleContains": "iPhone"}
Example 7: Filter by Description
Only scrape articles about a specific topic in the description:
{"usePresets": false,"customUrls": ["https://techcrunch.com"],"maxArticlesPerSource": 25,"descriptionContains": "startup funding"}
Example 8: Combine Multiple Filters
Search for AI articles with "OpenAI" in the title:
{"usePresets": true,"presetSources": ["verge", "techcrunch", "arstechnica"],"maxArticlesPerSource": 30,"searchKeywords": ["AI", "ChatGPT"],"titleContains": "OpenAI"}
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
usePresets | boolean | true | Use predefined sites or custom URLs |
presetSources | array | ["verge", "techcrunch", "wired"] | List of preset site keys |
customUrls | array | [] | Custom URLs to scrape (when usePresets is false) |
maxArticlesPerSource | integer | 10 | Max articles per source (1-100) |
maxPages | integer | 1 | Max listing pages to check (1-10) |
includeContent | boolean | true | Extract full article text |
searchKeywords | array | [] | Filter articles by keywords (matches ANY keyword) |
titleContains | string | "" | Only scrape articles with this text in title |
descriptionContains | string | "" | Only scrape articles with this text in description |
Filtering Behavior
- searchKeywords: Articles matching ANY of the keywords in title OR description will be included
- titleContains: Only articles with this exact text (case-insensitive) in the title
- descriptionContains: Only articles with this exact text (case-insensitive) in the summary/description
- Combined filters: All specified filters must match (AND logic)
Output Format
Articles are saved to the dataset in the following format:
{"title": "Article Title","url": "https://example.com/article","author": "Author Name","published_date": "2025-11-08T12:00:00+00:00","content": "Full article text...","summary": "Brief summary or excerpt","source": "Site Name","tags": ["tag1", "tag2"],"image_url": "https://example.com/image.jpg","scraped_at": "2025-11-08T13:45:00+00:00"}
Viewing Results
After a successful run:
- Dataset Tab: View all scraped articles in a table
- Export Options: Download as JSON, CSV, XML, or Excel
- API Access: Access results programmatically via Apify API
- Schedule: Set up periodic scraping (hourly, daily, weekly)
Best Practices
- Start Small: Test with a few articles before scaling up
- Respect Robots.txt: The scraper respects robots.txt automatically
- Rate Limiting: Built-in delays prevent overwhelming servers
- Monitor Logs: Check logs for warnings about missing content
- Adjust Parameters: If scraping fails, try reducing
maxArticlesPerSource
Limitations
- Some sites may use JavaScript-heavy rendering (not supported by this scraper)
- Paywalled content cannot be extracted
- Sites with anti-scraping measures may block requests
- Content structure varies; generic extraction may miss some fields
Troubleshooting
No articles found
- Check if the URL is correct and accessible in a browser
- Verify the site allows scraping (check robots.txt)
- Try increasing
maxPagesparameter - Some sites require JavaScript - this scraper uses static HTML only
Missing content fields
- Some sites have unique structures that the generic scraper might miss
- This is normal - not all sites have all fields
- The scraper uses multiple fallback strategies, but some data may be unavailable
Connection timeouts
- Check your internet connection
- Some sites may be blocking automated requests
- Try reducing
maxArticlesPerSourceto avoid overwhelming the target site
Rate limiting errors
- Reduce
maxArticlesPerSource(try 5-10 instead of 50+) - Scrape fewer sources per run
- Wait between runs
FAQ
Q: Can I scrape paywalled content? A: No, this scraper only accesses publicly available content. It respects the same limitations as a regular web browser.
Q: How fast is the scraper? A: Approximately 1-2 articles per second, depending on site response time and content size. Built-in delays ensure polite scraping.
Q: Can I scrape sites not in the preset list?
A: Yes! Use the customUrls option to scrape any tech news site.
Q: Will this work with JavaScript-heavy sites? A: This scraper uses static HTML parsing. For JavaScript-heavy sites (React, Vue, etc.), you may need a browser-based solution.
Q: How do I schedule automated scraping? A: Use Apify's scheduler feature to run hourly, daily, or weekly.
Q: Can I export data to a database? A: Yes, the scraper outputs JSON which can be easily imported into databases. On Apify, you can use integrations to automatically push data to various services.
Q: Is this legal? A: Web scraping legality depends on the website's terms of service and your jurisdiction. Always:
- Check the site's
robots.txt - Review their Terms of Service
- Don't overload their servers
- Use data responsibly
This tool is for educational and legitimate use cases only.
License
This project is provided as-is for educational and legitimate scraping purposes. Always respect website terms of service and robots.txt files.
