News & Article Extractor
Pricing
Pay per event
News & Article Extractor
Auto-discover and extract articles from news sites, blogs, and publications. Finds RSS feeds and sitemaps automatically. Outputs title, author, date, full text, images, and metadata. No proxy needed.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
8
Total users
3
Monthly active users
17 days ago
Last modified
Categories
Share
Extract articles from any news website, blog, or publication — automatically. Give it a URL and it discovers articles via RSS feeds, sitemaps, or HTML crawling, then pulls the full text using @mozilla/readability.
No API key needed. No browser overhead. Just pure HTTP extraction.
📰 What does News & Article Extractor do?
News & Article Extractor auto-discovers and extracts articles from news sites, blogs, and academic publications. Point it at any website — TechCrunch, BBC News, your company blog, or any RSS feed — and it returns structured article data: title, author, publish date, full text content, images, and more.
The extractor uses a three-tier discovery strategy:
- RSS auto-discovery — detects RSS/Atom feeds from
<link rel="alternate">tags or common paths (/feed,/rss.xml) - sitemap.xml — parses XML sitemaps including news sitemaps for systematic URL discovery
- HTML crawl — falls back to extracting article links from the homepage
Once articles are found, @mozilla/readability (the same engine Firefox Reader View uses) strips navigation, ads, and boilerplate to return clean article text.
👤 Who is News & Article Extractor for?
Researchers and academics monitoring a beat or topic:
- Track publications across dozens of news sources daily
- Build training datasets from articles across multiple blogs
- Monitor academic preprint servers and publication feeds
Content marketers and SEO teams:
- Audit competitor blog content and publishing cadence
- Aggregate industry news for internal newsletters
- Monitor brand mentions across news publications
Data scientists and ML engineers:
- Build NLP training corpora from news articles
- Create RAG (Retrieval-Augmented Generation) knowledge bases
- Feed structured article data into analysis pipelines
Business intelligence teams:
- Monitor competitor press releases and announcements
- Track industry trends from multiple publications
- Export article data to Google Sheets, Airtable, or databases
✅ Why use News & Article Extractor?
- Automatic discovery — no need to manually find RSS feeds or sitemaps; the extractor tries all methods automatically
- Clean text extraction — @mozilla/readability removes ads, navigation, footers, and cookie banners
- RSS metadata included — when articles come from RSS, you get author, date, and description for free (no extra HTTP request)
- Metadata-only mode — set
extractFullContent: falseto get just titles, dates, and URLs blazing fast and at minimal cost - Date filtering — filter articles by publication date range to get only recent content
- No proxy needed — most news sites are publicly accessible; pure HTTP extraction
- Structured output — every article outputs the same fields: title, author, publishedDate, content, wordCount, imageUrl, images, sourceDomain
- Graceful error handling — failed articles are skipped and logged; the run continues
- No API key or login required
📊 What data can you extract?
| Field | Description | Type |
|---|---|---|
title | Article headline | string |
author | Byline / author name | string |
publishedDate | ISO 8601 publication date | string |
description | Article summary or meta description | string |
content | Full article body text (plain text) | string |
wordCount | Number of words in article content | number |
url | Canonical article URL | string |
imageUrl | Primary/OG image URL | string |
images | All image URLs found in article | array |
sourceDomain | Domain of the source site | string |
sourceUrl | Root URL of the source site | string |
discoveryMethod | How the article was found: rss, sitemap, or html-crawl | string |
extractedAt | Timestamp of extraction | string |
success | Whether extraction succeeded | boolean |
error | Error message if extraction failed | string |
💰 How much does it cost to extract news articles?
News & Article Extractor uses Pay-Per-Event (PPE) pricing — you pay only for results, not for compute time.
| Event | FREE tier | BRONZE | SILVER | GOLD |
|---|---|---|---|---|
| Actor start (one-time) | $0.005 | $0.005 | $0.005 | $0.005 |
| Per article extracted | $0.0023 | $0.002 | $0.00156 | $0.0012 |
Real-world cost examples:
- Extract 10 articles: ~$0.025 (pennies)
- Extract 100 articles: ~$0.205
- Extract 1,000 articles: ~$2.005
On the free plan ($5 Apify credits), you can extract roughly 2,100+ articles before spending a cent of your own money.
Metadata-only mode (extractFullContent: false) is charged the same — the savings come from faster runs, not lower per-article cost.
🚀 How to extract articles from a news website
- Go to the News & Article Extractor page on Apify Store
- Click Try for free
- In the Website URLs field, enter one or more website URLs (e.g.,
https://techcrunch.com,https://bbc.com/news) - Set Max Articles per Site — start with 10-20 for a quick test
- Leave Extract Full Content enabled to get full article text, or disable it for metadata-only (faster)
- Click Start and wait for results (typically 30-90 seconds for 10-20 articles)
- Export results as JSON, CSV, or Excel from the dataset tab
Input JSON examples:
Extract recent articles from two sources:
{"startUrls": ["https://techcrunch.com", "https://theverge.com"],"maxArticles": 20,"extractFullContent": true,"includeImages": true}
Use an RSS feed URL directly:
{"startUrls": ["https://feeds.bbci.co.uk/news/rss.xml"],"maxArticles": 50,"extractFullContent": true}
Metadata-only, last 7 days:
{"startUrls": ["https://blog.apify.com"],"maxArticles": 100,"extractFullContent": false,"dateFrom": "2026-04-01"}
⚙️ Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Website URLs or RSS/sitemap URLs to process |
maxArticles | integer | 20 | Maximum articles to extract per site |
extractFullContent | boolean | true | Fetch and extract full article body text |
includeImages | boolean | true | Include image URLs in output |
dateFrom | string | — | Only articles on/after this date (YYYY-MM-DD) |
dateTo | string | — | Only articles on/before this date (YYYY-MM-DD) |
requestTimeout | integer | 30 | HTTP request timeout in seconds |
maxRetries | integer | 2 | Retry attempts per failed request |
Tips for startUrls:
- You can enter full domain URLs (
https://techcrunch.com) or direct RSS feed URLs (https://feeds.bbci.co.uk/news/rss.xml) - For sites with many sections, enter specific section URLs (e.g.,
https://bbc.com/news/technology) - Academic sites like arXiv work via HTML crawl:
https://arxiv.org/list/cs.AI/recent
📤 Output examples
Full content extraction:
{"url": "https://techcrunch.com/2026/04/07/waymo-opens-robotaxi-service-in-nashville/","title": "Waymo opens robotaxi service in Nashville, partners with Lyft","author": "Kirsten Korosec","publishedDate": "2026-04-07T14:00:00.000Z","description": "Waymo is expanding its robotaxi service beyond its current markets...","content": "Waymo is expanding its autonomous vehicle service to Nashville, Tennessee...","wordCount": 537,"imageUrl": "https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg","images": ["https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg"],"sourceDomain": "techcrunch.com","sourceUrl": "https://techcrunch.com","discoveryMethod": "rss","extractedAt": "2026-04-07T14:05:22.000Z","success": true,"error": null}
Metadata-only mode (extractFullContent: false):
{"url": "https://www.bbc.com/news/articles/cx23p6j5gxgo","title": "Artemis II crew head for home after historic lunar flyby","author": "Jonathan Amos","publishedDate": "2026-04-07T13:00:04.000Z","description": "The four astronauts flew closer to the Moon than any humans since Apollo 17 in 1972.","content": null,"wordCount": 0,"imageUrl": "https://ichef.bbci.co.uk/news/1024/branded_news/...jpg","images": [],"sourceDomain": "bbc.com","discoveryMethod": "rss","success": true,"error": null}
💡 Tips for best results
- Start with RSS feed URLs — if you know a site's RSS feed (
/feed,/rss.xml), enter it directly. RSS feeds include metadata (author, date, description) without an extra HTTP request per article. - Use metadata-only mode for monitoring — when you just need to know what articles were published and when, disable
extractFullContent. It's much faster and costs the same per-article. - Set date filters for recurring runs — schedule the actor daily and use
dateFromto avoid re-extracting old articles. - Lower
maxArticlesfor quick tests — start with 5-10 articles to verify the site works before scaling up. - Paywalled sites — the extractor fetches pages as a regular browser would. Paywalled content that requires login won't be accessible.
- JavaScript-heavy sites — some modern sites render content via JavaScript. If the extractor returns empty content for a site that has articles, the site may require a browser-based approach.
- Multiple sections — for large news sites, add multiple section URLs to
startUrlsto cover more ground (e.g., bothhttps://nytimes.com/section/technologyandhttps://nytimes.com/section/business).
🔗 Integrations
News & Article Extractor → Google Sheets Schedule the actor to run daily and automatically append new articles to a Google Sheet. Use the built-in Apify → Google Sheets integration to build a living content archive. Great for editorial teams tracking industry news.
News & Article Extractor → Slack/Discord alerts Set up a webhook trigger: when the actor completes and finds new articles matching certain keywords, post a summary to your Slack channel. Perfect for brand monitoring or competitor tracking.
News & Article Extractor → Make/Zapier content pipeline Connect via Apify's Make or Zapier integration to route new articles to your CMS, Notion database, or email newsletter tool. Build a fully automated content curation pipeline.
Scheduled monitoring runs Schedule runs every hour or day using Apify's built-in scheduler. Combine with date filters to only extract articles published since the last run. No duplicates, no manual work.
News & Article Extractor → RAG / LLM pipeline
Export article content to your vector database (Pinecone, Weaviate, Chroma) for retrieval-augmented generation. The content field gives you clean plain text ready for embedding.
🖥️ Using the Apify API
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('automation-lab/news-article-extractor').call({startUrls: ['https://techcrunch.com', 'https://theverge.com'],maxArticles: 50,extractFullContent: true,includeImages: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`Extracted ${items.length} articles`);items.slice(0, 3).forEach(article => {console.log(`[${article.publishedDate}] ${article.title} — ${article.wordCount} words`);});
Python
from apify_client import ApifyClientclient = ApifyClient(token='YOUR_API_TOKEN')run = client.actor('automation-lab/news-article-extractor').call(run_input={'startUrls': ['https://techcrunch.com', 'https://theverge.com'],'maxArticles': 50,'extractFullContent': True,'includeImages': True,})dataset = client.dataset(run['defaultDatasetId'])articles = dataset.list_items().itemsprint(f'Extracted {len(articles)} articles')for article in articles[:3]:print(f"[{article['publishedDate']}] {article['title']} — {article['wordCount']} words")
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~news-article-extractor/runs" \-H "Content-Type: application/json" \-H "Authorization: Bearer YOUR_API_TOKEN" \-d '{"startUrls": ["https://techcrunch.com"],"maxArticles": 20,"extractFullContent": true}'
🤖 Use with AI agents via MCP
News & Article Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).
Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/news-article-extractor"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com?tools=automation-lab/news-article-extractor"}}}
Your AI assistant will use OAuth to authenticate with your Apify account on first use.
Example prompts
Once connected, try asking your AI assistant:
- "Use automation-lab/news-article-extractor to extract the 20 most recent articles from TechCrunch and summarize the main themes"
- "Extract all articles published today from https://bbc.com/news and identify which topics appear most frequently"
- "Get metadata for the last 50 articles from https://blog.apify.com and tell me the average word count per post"
Learn more in the Apify MCP documentation.
⚖️ Is it legal to scrape news articles?
News & Article Extractor only accesses publicly available pages — the same content any web browser would see. It does not bypass authentication, circumvent paywalls, or access restricted content.
Best practices:
- Respect
robots.txtguidelines for the sites you scrape - Do not scrape personal data beyond what's in public article bylines
- Check each website's terms of service regarding automated access
- Use the data ethically — for research, monitoring, and analysis, not plagiarism or content theft
- The actor does not store or redistribute copyrighted content — it extracts it to your own Apify dataset
For more information, see Apify's web scraping guide on legality.
❓ FAQ
How does article discovery work?
The extractor tries three methods in order: (1) RSS/Atom feed detection via <link rel="alternate"> tags or common paths like /feed and /rss.xml; (2) sitemap.xml parsing; (3) HTML link extraction from the homepage. If one method fails, it falls back to the next automatically.
Why are some articles returning empty content?
Some sites use JavaScript to render their content (e.g., heavy React/Next.js sites). Since this extractor uses pure HTTP (no browser), JavaScript-rendered content won't be visible. If you see success: true but empty content, the site likely requires JavaScript rendering. Try extracting metadata only from the RSS feed instead.
How fast is extraction? Metadata-only mode (RSS) processes 50-100 articles in under 10 seconds — it's just parsing a feed. Full content extraction takes 1-3 seconds per article depending on page size and server speed. 20 articles typically complete in 30-60 seconds.
Can I use this with paywalled sites? No — the extractor does not support authentication or login. It can only access content that's publicly visible without logging in. Some sites offer free articles up to a limit before showing a paywall.
Why does the actor find fewer articles than expected?
The maxArticles limit per site applies. Also, RSS feeds typically contain only the 20-50 most recent articles. For older content, use the sitemap method by entering the site URL (not the RSS URL) — sitemaps often contain thousands of URLs.
Why are images not showing up?
Some sites serve images with lazy-loading (no src attribute until JavaScript runs) or use CSS backgrounds instead of <img> tags. The OG image (og:image meta tag) is always captured when available. Enable includeImages: true and check the imageUrl field first.
🔗 Other content scrapers
- Google News Scraper — scrape Google News results by keyword
- Bing News Scraper — extract news from Bing News search
- HackerNews Scraper — scrape Hacker News posts and comments
- Webpage to Markdown Converter — convert any webpage to clean Markdown for LLMs
- ArXiv Scraper — extract academic papers from arXiv.org