article-scrapper avatar

article-scrapper

Pricing

$3.00 / 1,000 result_sets

Go to Apify Store
article-scrapper

article-scrapper

A flexible and powerful Apify Actor for scraping articles from tech news websites. This scraper can work with any tech news site - either from predefined presets or custom URLs

Pricing

$3.00 / 1,000 result_sets

Rating

0.0

(0)

Developer

RK K

RK K

Maintained by Community

Actor stats

1

Bookmarked

25

Total users

1

Monthly active users

4 days ago

Last modified

Categories

Share

Tech News Article Scraper

A powerful Apify Actor for scraping articles from 14 popular tech news sources or any custom URL. Extracts full article content, metadata, sentiment, word count, and reading time. Supports RSS feeds for fast and reliable scraping.

Features

  • 14 Preset Sources — The Verge, TechCrunch, Wired, Hacker News, BBC Tech, MIT Technology Review, VentureBeat, Dev.to, Product Hunt, and more
  • RSS Feed Support — Faster and more reliable than HTML scraping; automatically used for all preset sources
  • Custom URLs — Scrape any news site by providing your own URLs
  • Smart Extraction — Automatically detects titles, authors, dates, content, images, and tags
  • Sentiment Analysis — Labels each article as positive, neutral, or negative with a confidence score
  • Date Range Filtering — Filter by date range or relative window (e.g. last 7 days)
  • Flexible Filters — Filter by keywords, title text, or description with AND/OR logic
  • Word Count & Reading Time — Automatically calculated for every article
  • Robust Error Handling — Retry logic, graceful fallbacks, and detailed logging

Preset Sources

KeySourceFocus
vergeThe VergeTech news and media
techcrunchTechCrunchStartups and technology
wiredWiredTechnology and culture
cnetCNETProduct reviews and news
arstechnicaArs TechnicaIn-depth tech analysis
engadgetEngadgetConsumer electronics
theguardian-techThe Guardian TechTech news from The Guardian
thenextwebThe Next WebInternational tech news
hackernewsHacker NewsDeveloper and startup community
bbc-techBBC TechnologyGlobal tech news
mit-tech-reviewMIT Technology ReviewDeep tech and AI research
venturebeatVentureBeatAI, startups, and enterprise tech
devtoDev.toDeveloper articles and tutorials
producthuntProduct HuntDaily startup and product launches

Input Parameters

ParameterTypeDefaultDescription
usePresetsbooleantrueUse preset sources or custom URLs
presetSourcesarray["verge", "techcrunch", "wired"]List of preset source keys
customUrlsarray[]Custom URLs to scrape (when usePresets is false)
maxArticlesPerSourceinteger10Max articles per source (1–100)
maxPagesinteger1Max listing pages to check per source (1–10)
includeContentbooleantrueExtract full article text
useRssFeedsbooleantrueUse RSS feeds for preset sources (faster, more reliable)
includeSentimentbooleantrueAdd sentiment analysis to each article
sentimentFilterstring"all"Only return articles with this sentiment: all, positive, neutral, negative
filterModestring"any"any = article passes if it matches at least one filter (OR). all = article must pass every filter (AND)
searchKeywordsarray[]Filter articles containing any of these keywords in title or summary
titleContainsstring""Only include articles with this text in the title
descriptionContainsstring""Only include articles with this text in the summary or content
dateFromstring""Only include articles published on or after this date (YYYY-MM-DD)
dateTostring""Only include articles published on or before this date (YYYY-MM-DD)
publishedWithinstring""Only include articles published within this window. Examples: 24h, 7d, 30d, 2w

Output Format

Each article is saved with the following fields:

{
"title": "OpenAI Releases GPT-5",
"url": "https://techcrunch.com/2025/01/15/openai-gpt5",
"author": "Jane Doe",
"published_date": "2025-01-15T10:30:00+00:00",
"content": "Full article text...",
"summary": "OpenAI today announced GPT-5, its most capable model yet...",
"source": "TechCrunch",
"tags": ["AI", "OpenAI", "GPT"],
"image_url": "https://techcrunch.com/images/gpt5.jpg",
"word_count": 847,
"reading_time_minutes": 4,
"sentiment": {
"label": "positive",
"score": 0.6249
},
"scraped_at": "2025-01-15T11:00:00+00:00"
}

sentiment is only present when includeSentiment is true.


Usage Examples

1. Morning AI News Briefing

Get the latest AI articles from the last 24 hours across major sources:

{
"usePresets": true,
"presetSources": ["techcrunch", "verge", "wired", "mit-tech-review"],
"maxArticlesPerSource": 20,
"publishedWithin": "24h",
"searchKeywords": ["AI", "artificial intelligence", "ChatGPT", "LLM"],
"filterMode": "any",
"useRssFeeds": true,
"includeSentiment": true
}

2. Brand Monitoring — Negative Press Only

Track negative coverage about a topic:

{
"usePresets": true,
"presetSources": ["techcrunch", "verge", "arstechnica", "bbc-tech"],
"maxArticlesPerSource": 30,
"searchKeywords": ["Apple", "iPhone"],
"filterMode": "any",
"includeSentiment": true,
"sentimentFilter": "negative",
"publishedWithin": "7d"
}

3. Startup Launch Tracker

Track new product launches and funding news:

{
"usePresets": true,
"presetSources": ["producthunt", "techcrunch", "venturebeat"],
"maxArticlesPerSource": 20,
"searchKeywords": ["launch", "funding", "raises", "Series A"],
"filterMode": "any",
"dateFrom": "2025-01-01"
}

4. Developer Community Digest

Pull community articles from Dev.to and Hacker News:

{
"usePresets": true,
"presetSources": ["devto", "hackernews"],
"maxArticlesPerSource": 50,
"includeContent": true,
"useRssFeeds": true
}

5. Custom Site Scraping

Scrape any site not in the preset list:

{
"usePresets": false,
"customUrls": [
"https://9to5mac.com",
"https://9to5google.com"
],
"maxArticlesPerSource": 15,
"maxPages": 2,
"includeContent": true
}

6. Long-form Articles Only

Use word count to filter out short posts and stubs:

{
"usePresets": true,
"presetSources": ["wired", "arstechnica", "mit-tech-review"],
"maxArticlesPerSource": 20,
"includeContent": true
}

Tip: After export, filter word_count > 800 in your spreadsheet to get long-form articles only.


Filtering Guide

Three filter types are available — filterMode controls how they combine:

filterModeBehaviour
any (recommended)Article passes if it matches at least one active filter
allArticle must pass every active filter simultaneously

Example with filterMode: "any":

  • searchKeywords: ["AI"] matches → article included, regardless of other filters

Example with filterMode: "all":

  • Article must match searchKeywords AND titleContains AND descriptionContains — very strict

Viewing Results

  1. Dataset Tab — View all articles in a table after the run completes
  2. Export — Download as JSON, CSV, XML, or Excel
  3. API — Access results programmatically via the Apify API
  4. Schedule — Set up automated daily or weekly runs via Apify Scheduler

Troubleshooting

Getting 0 articles with multiple filters set

  • Switch filterMode to "any" — AND logic ("all") is very strict
  • Increase maxArticlesPerSource so more articles are available to filter
  • Check the logs — each rejected article shows which filter removed it

Missing content or short word counts

  • RSS feeds provide summaries, not full articles — set useRssFeeds: false for full HTML scraping
  • Some sites have unique page structures the generic scraper may miss

Connection timeouts or blocked requests

  • Reduce maxArticlesPerSource to avoid rate limiting
  • Some sites block automated requests — try a different source

JavaScript-heavy sites not working

  • This scraper uses static HTML parsing — JS-rendered pages are not supported
  • Use RSS mode (useRssFeeds: true) for preset sources to avoid this entirely

FAQ

Can I scrape paywalled content? No — only publicly accessible content is scraped.

How fast is it? RSS mode: ~10 articles per second. HTML mode: ~1–2 articles per second due to polite rate limiting.

Can I add my own sources permanently? Use customUrls for one-off scraping, or open a GitHub issue to request a new preset source.

Is this legal? Web scraping legality depends on the site's terms of service and your jurisdiction. Always check robots.txt, avoid overloading servers, and use data responsibly. This tool is for legitimate use cases only.


License

Provided as-is for educational and legitimate scraping purposes. Always respect website terms of service and robots.txt files.