article-scrapper
Pricing
$3.00 / 1,000 result_sets
article-scrapper
A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs
Tech News Article Scraper
A powerful Apify Actor for scraping articles from 14 popular tech news sources or any custom URL. Extracts full article content, metadata, sentiment, word count, and reading time. Supports RSS feeds for fast and reliable scraping.
Features
- 14 Preset Sources — The Verge, TechCrunch, Wired, Hacker News, BBC Tech, MIT Technology Review, VentureBeat, Dev.to, Product Hunt, and more
- RSS Feed Support — Faster and more reliable than HTML scraping; automatically used for all preset sources
- Custom URLs — Scrape any news site by providing your own URLs
- Smart Extraction — Automatically detects titles, authors, dates, content, images, and tags
- Sentiment Analysis — Labels each article as positive, neutral, or negative with a confidence score
- Date Range Filtering — Filter by date range or relative window (e.g. last 7 days)
- Flexible Filters — Filter by keywords, title text, or description with AND/OR logic
- Word Count & Reading Time — Automatically calculated for every article
- Robust Error Handling — Retry logic, graceful fallbacks, and detailed logging
Preset Sources
| Key | Source | Focus |
|---|---|---|
verge | The Verge | Tech news and media |
techcrunch | TechCrunch | Startups and technology |
wired | Wired | Technology and culture |
cnet | CNET | Product reviews and news |
arstechnica | Ars Technica | In-depth tech analysis |
engadget | Engadget | Consumer electronics |
theguardian-tech | The Guardian Tech | Tech news from The Guardian |
thenextweb | The Next Web | International tech news |
hackernews | Hacker News | Developer and startup community |
bbc-tech | BBC Technology | Global tech news |
mit-tech-review | MIT Technology Review | Deep tech and AI research |
venturebeat | VentureBeat | AI, startups, and enterprise tech |
devto | Dev.to | Developer articles and tutorials |
producthunt | Product Hunt | Daily startup and product launches |
Input Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
usePresets | boolean | true | Use preset sources or custom URLs |
presetSources | array | ["verge", "techcrunch", "wired"] | List of preset source keys |
customUrls | array | [] | Custom URLs to scrape (when usePresets is false) |
maxArticlesPerSource | integer | 10 | Max articles per source (1–100) |
maxPages | integer | 1 | Max listing pages to check per source (1–10) |
includeContent | boolean | true | Extract full article text |
useRssFeeds | boolean | true | Use RSS feeds for preset sources (faster, more reliable) |
includeSentiment | boolean | true | Add sentiment analysis to each article |
sentimentFilter | string | "all" | Only return articles with this sentiment: all, positive, neutral, negative |
filterMode | string | "any" | any = article passes if it matches at least one filter (OR). all = article must pass every filter (AND) |
searchKeywords | array | [] | Filter articles containing any of these keywords in title or summary |
titleContains | string | "" | Only include articles with this text in the title |
descriptionContains | string | "" | Only include articles with this text in the summary or content |
dateFrom | string | "" | Only include articles published on or after this date (YYYY-MM-DD) |
dateTo | string | "" | Only include articles published on or before this date (YYYY-MM-DD) |
publishedWithin | string | "" | Only include articles published within this window. Examples: 24h, 7d, 30d, 2w |
Output Format
Each article is saved with the following fields:
{"title": "OpenAI Releases GPT-5","url": "https://techcrunch.com/2025/01/15/openai-gpt5","author": "Jane Doe","published_date": "2025-01-15T10:30:00+00:00","content": "Full article text...","summary": "OpenAI today announced GPT-5, its most capable model yet...","source": "TechCrunch","tags": ["AI", "OpenAI", "GPT"],"image_url": "https://techcrunch.com/images/gpt5.jpg","word_count": 847,"reading_time_minutes": 4,"sentiment": {"label": "positive","score": 0.6249},"scraped_at": "2025-01-15T11:00:00+00:00"}
sentimentis only present whenincludeSentimentistrue.
Usage Examples
1. Morning AI News Briefing
Get the latest AI articles from the last 24 hours across major sources:
{"usePresets": true,"presetSources": ["techcrunch", "verge", "wired", "mit-tech-review"],"maxArticlesPerSource": 20,"publishedWithin": "24h","searchKeywords": ["AI", "artificial intelligence", "ChatGPT", "LLM"],"filterMode": "any","useRssFeeds": true,"includeSentiment": true}
2. Brand Monitoring — Negative Press Only
Track negative coverage about a topic:
{"usePresets": true,"presetSources": ["techcrunch", "verge", "arstechnica", "bbc-tech"],"maxArticlesPerSource": 30,"searchKeywords": ["Apple", "iPhone"],"filterMode": "any","includeSentiment": true,"sentimentFilter": "negative","publishedWithin": "7d"}
3. Startup Launch Tracker
Track new product launches and funding news:
{"usePresets": true,"presetSources": ["producthunt", "techcrunch", "venturebeat"],"maxArticlesPerSource": 20,"searchKeywords": ["launch", "funding", "raises", "Series A"],"filterMode": "any","dateFrom": "2025-01-01"}
4. Developer Community Digest
Pull community articles from Dev.to and Hacker News:
{"usePresets": true,"presetSources": ["devto", "hackernews"],"maxArticlesPerSource": 50,"includeContent": true,"useRssFeeds": true}
5. Custom Site Scraping
Scrape any site not in the preset list:
{"usePresets": false,"customUrls": ["https://9to5mac.com","https://9to5google.com"],"maxArticlesPerSource": 15,"maxPages": 2,"includeContent": true}
6. Long-form Articles Only
Use word count to filter out short posts and stubs:
{"usePresets": true,"presetSources": ["wired", "arstechnica", "mit-tech-review"],"maxArticlesPerSource": 20,"includeContent": true}
Tip: After export, filter
word_count > 800in your spreadsheet to get long-form articles only.
Filtering Guide
Three filter types are available — filterMode controls how they combine:
| filterMode | Behaviour |
|---|---|
any (recommended) | Article passes if it matches at least one active filter |
all | Article must pass every active filter simultaneously |
Example with filterMode: "any":
searchKeywords: ["AI"]matches → article included, regardless of other filters
Example with filterMode: "all":
- Article must match
searchKeywordsANDtitleContainsANDdescriptionContains— very strict
Viewing Results
- Dataset Tab — View all articles in a table after the run completes
- Export — Download as JSON, CSV, XML, or Excel
- API — Access results programmatically via the Apify API
- Schedule — Set up automated daily or weekly runs via Apify Scheduler
Troubleshooting
Getting 0 articles with multiple filters set
- Switch
filterModeto"any"— AND logic ("all") is very strict - Increase
maxArticlesPerSourceso more articles are available to filter - Check the logs — each rejected article shows which filter removed it
Missing content or short word counts
- RSS feeds provide summaries, not full articles — set
useRssFeeds: falsefor full HTML scraping - Some sites have unique page structures the generic scraper may miss
Connection timeouts or blocked requests
- Reduce
maxArticlesPerSourceto avoid rate limiting - Some sites block automated requests — try a different source
JavaScript-heavy sites not working
- This scraper uses static HTML parsing — JS-rendered pages are not supported
- Use RSS mode (
useRssFeeds: true) for preset sources to avoid this entirely
FAQ
Can I scrape paywalled content? No — only publicly accessible content is scraped.
How fast is it? RSS mode: ~10 articles per second. HTML mode: ~1–2 articles per second due to polite rate limiting.
Can I add my own sources permanently?
Use customUrls for one-off scraping, or open a GitHub issue to request a new preset source.
Is this legal?
Web scraping legality depends on the site's terms of service and your jurisdiction. Always check robots.txt, avoid overloading servers, and use data responsibly. This tool is for legitimate use cases only.
License
Provided as-is for educational and legitimate scraping purposes. Always respect website terms of service and robots.txt files.