Pricing

Pay per event

News & Article Extractor

Auto-discover and extract articles from news sites, blogs, and publications. Finds RSS feeds and sitemaps automatically. Outputs title, author, date, full text, images, and metadata. No proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

📰 What does News & Article Extractor do?

News & Article Extractor auto-discovers and extracts articles from news sites, blogs, and academic publications. Point it at any website — TechCrunch, BBC News, your company blog, or any RSS feed — and it returns structured article data: title, author, publish date, full text content, images, and more.

The extractor uses a three-tier discovery strategy:

RSS auto-discovery — detects RSS/Atom feeds from <link rel="alternate"> tags or common paths (/feed, /rss.xml)
sitemap.xml — parses XML sitemaps including news sitemaps for systematic URL discovery
HTML crawl — falls back to extracting article links from the homepage

Once articles are found, @mozilla/readability (the same engine Firefox Reader View uses) strips navigation, ads, and boilerplate to return clean article text.

👤 Who is News & Article Extractor for?

Researchers and academics monitoring a beat or topic:

Track publications across dozens of news sources daily
Build training datasets from articles across multiple blogs
Monitor academic preprint servers and publication feeds

Content marketers and SEO teams:

Audit competitor blog content and publishing cadence
Aggregate industry news for internal newsletters
Monitor brand mentions across news publications

Data scientists and ML engineers:

Build NLP training corpora from news articles
Create RAG (Retrieval-Augmented Generation) knowledge bases
Feed structured article data into analysis pipelines

Business intelligence teams:

Monitor competitor press releases and announcements
Track industry trends from multiple publications
Export article data to Google Sheets, Airtable, or databases

✅ Why use News & Article Extractor?

Automatic discovery — no need to manually find RSS feeds or sitemaps; the extractor tries all methods automatically
Clean text extraction — @mozilla/readability removes ads, navigation, footers, and cookie banners
RSS metadata included — when articles come from RSS, you get author, date, and description for free (no extra HTTP request)
Metadata-only mode — set extractFullContent: false to get just titles, dates, and URLs blazing fast and at minimal cost
Date filtering — filter articles by publication date range to get only recent content
No proxy needed — most news sites are publicly accessible; pure HTTP extraction
Structured output — every article outputs the same fields: title, author, publishedDate, content, wordCount, imageUrl, images, sourceDomain
Graceful error handling — failed articles are skipped and logged; the run continues
No API key or login required

📊 What data can you extract?

Field	Description	Type
`title`	Article headline	string
`author`	Byline / author name	string
`publishedDate`	ISO 8601 publication date	string
`description`	Article summary or meta description	string
`content`	Full article body text (plain text)	string
`wordCount`	Number of words in article content	number
`url`	Canonical article URL	string
`imageUrl`	Primary/OG image URL	string
`images`	All image URLs found in article	array
`sourceDomain`	Domain of the source site	string
`sourceUrl`	Root URL of the source site	string
`discoveryMethod`	How the article was found: `rss`, `sitemap`, or `html-crawl`	string
`extractedAt`	Timestamp of extraction	string
`success`	Whether extraction succeeded	boolean
`error`	Error message if extraction failed	string

💰 How much does it cost to extract news articles?

News & Article Extractor uses Pay-Per-Event (PPE) pricing — you pay only for results, not for compute time.

Event	FREE tier	BRONZE	SILVER	GOLD
Actor start (one-time)	$0.005	$0.005	$0.005	$0.005
Per article extracted	$0.0023	$0.002	$0.00156	$0.0012

Real-world cost examples:

Extract 10 articles: ~$0.025 (pennies)
Extract 100 articles: ~$0.205
Extract 1,000 articles: ~$2.005

On the free plan ($5 Apify credits), you can extract roughly 2,100+ articles before spending a cent of your own money.

Metadata-only mode (extractFullContent: false) is charged the same — the savings come from faster runs, not lower per-article cost.

🚀 How to extract articles from a news website

Go to the News & Article Extractor page on Apify Store
Click Try for free
In the Website URLs field, enter one or more website URLs (e.g., https://techcrunch.com, https://bbc.com/news)
Set Max Articles per Site — start with 10-20 for a quick test
Leave Extract Full Content enabled to get full article text, or disable it for metadata-only (faster)
Click Start and wait for results (typically 30-90 seconds for 10-20 articles)
Export results as JSON, CSV, or Excel from the dataset tab

Input JSON examples:

Extract recent articles from two sources:

{
    "startUrls": ["https://techcrunch.com", "https://theverge.com"],
    "maxArticles": 20,
    "extractFullContent": true,
    "includeImages": true
}

Use an RSS feed URL directly:

{
    "startUrls": ["https://feeds.bbci.co.uk/news/rss.xml"],
    "maxArticles": 50,
    "extractFullContent": true
}

Metadata-only, last 7 days:

{
    "startUrls": ["https://blog.apify.com"],
    "maxArticles": 100,
    "extractFullContent": false,
    "dateFrom": "2026-04-01"
}

⚙️ Input parameters

Parameter	Type	Default	Description
`startUrls`	array	required	Website URLs or RSS/sitemap URLs to process
`maxArticles`	integer	20	Maximum articles to extract per site
`extractFullContent`	boolean	true	Fetch and extract full article body text
`includeImages`	boolean	true	Include image URLs in output
`dateFrom`	string	—	Only articles on/after this date (YYYY-MM-DD)
`dateTo`	string	—	Only articles on/before this date (YYYY-MM-DD)
`requestTimeout`	integer	30	HTTP request timeout in seconds
`maxRetries`	integer	2	Retry attempts per failed request

Tips for startUrls:

You can enter full domain URLs (https://techcrunch.com) or direct RSS feed URLs (https://feeds.bbci.co.uk/news/rss.xml)
For sites with many sections, enter specific section URLs (e.g., https://bbc.com/news/technology)
Academic sites like arXiv work via HTML crawl: https://arxiv.org/list/cs.AI/recent

📤 Output examples

Full content extraction:

{
    "url": "https://techcrunch.com/2026/04/07/waymo-opens-robotaxi-service-in-nashville/",
    "title": "Waymo opens robotaxi service in Nashville, partners with Lyft",
    "author": "Kirsten Korosec",
    "publishedDate": "2026-04-07T14:00:00.000Z",
    "description": "Waymo is expanding its robotaxi service beyond its current markets...",
    "content": "Waymo is expanding its autonomous vehicle service to Nashville, Tennessee...",
    "wordCount": 537,
    "imageUrl": "https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg",
    "images": ["https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg"],
    "sourceDomain": "techcrunch.com",
    "sourceUrl": "https://techcrunch.com",
    "discoveryMethod": "rss",
    "extractedAt": "2026-04-07T14:05:22.000Z",
    "success": true,
    "error": null
}

Metadata-only mode (extractFullContent: false):

{
    "url": "https://www.bbc.com/news/articles/cx23p6j5gxgo",
    "title": "Artemis II crew head for home after historic lunar flyby",
    "author": "Jonathan Amos",
    "publishedDate": "2026-04-07T13:00:04.000Z",
    "description": "The four astronauts flew closer to the Moon than any humans since Apollo 17 in 1972.",
    "content": null,
    "wordCount": 0,
    "imageUrl": "https://ichef.bbci.co.uk/news/1024/branded_news/...jpg",
    "images": [],
    "sourceDomain": "bbc.com",
    "discoveryMethod": "rss",
    "success": true,
    "error": null
}

💡 Tips for best results

Start with RSS feed URLs — if you know a site's RSS feed (/feed, /rss.xml), enter it directly. RSS feeds include metadata (author, date, description) without an extra HTTP request per article.
Use metadata-only mode for monitoring — when you just need to know what articles were published and when, disable extractFullContent. It's much faster and costs the same per-article.
Set date filters for recurring runs — schedule the actor daily and use dateFrom to avoid re-extracting old articles.
Lower maxArticles for quick tests — start with 5-10 articles to verify the site works before scaling up.
Paywalled sites — the extractor fetches pages as a regular browser would. Paywalled content that requires login won't be accessible.
JavaScript-heavy sites — some modern sites render content via JavaScript. If the extractor returns empty content for a site that has articles, the site may require a browser-based approach.
Multiple sections — for large news sites, add multiple section URLs to startUrls to cover more ground (e.g., both https://nytimes.com/section/technology and https://nytimes.com/section/business).

🔗 Integrations

News & Article Extractor → Google Sheets Schedule the actor to run daily and automatically append new articles to a Google Sheet. Use the built-in Apify → Google Sheets integration to build a living content archive. Great for editorial teams tracking industry news.

News & Article Extractor → Slack/Discord alerts Set up a webhook trigger: when the actor completes and finds new articles matching certain keywords, post a summary to your Slack channel. Perfect for brand monitoring or competitor tracking.

News & Article Extractor → Make/Zapier content pipeline Connect via Apify's Make or Zapier integration to route new articles to your CMS, Notion database, or email newsletter tool. Build a fully automated content curation pipeline.

Scheduled monitoring runs Schedule runs every hour or day using Apify's built-in scheduler. Combine with date filters to only extract articles published since the last run. No duplicates, no manual work.

News & Article Extractor → RAG / LLM pipeline Export article content to your vector database (Pinecone, Weaviate, Chroma) for retrieval-augmented generation. The content field gives you clean plain text ready for embedding.

🖥️ Using the Apify API

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('automation-lab/news-article-extractor').call({
    startUrls: ['https://techcrunch.com', 'https://theverge.com'],
    maxArticles: 50,
    extractFullContent: true,
    includeImages: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} articles`);
items.slice(0, 3).forEach(article => {
    console.log(`[${article.publishedDate}] ${article.title} — ${article.wordCount} words`);
});

Python

from apify_client import ApifyClient

client = ApifyClient(token='YOUR_API_TOKEN')

run = client.actor('automation-lab/news-article-extractor').call(run_input={
    'startUrls': ['https://techcrunch.com', 'https://theverge.com'],
    'maxArticles': 50,
    'extractFullContent': True,
    'includeImages': True,
})

dataset = client.dataset(run['defaultDatasetId'])
articles = dataset.list_items().items
print(f'Extracted {len(articles)} articles')
for article in articles[:3]:
    print(f"[{article['publishedDate']}] {article['title']} — {article['wordCount']} words")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~news-article-extractor/runs" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{
    "startUrls": ["https://techcrunch.com"],
    "maxArticles": 20,
    "extractFullContent": true
  }'

🤖 Use with AI agents via MCP

News & Article Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/news-article-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/news-article-extractor"
        }
    }
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

"Use automation-lab/news-article-extractor to extract the 20 most recent articles from TechCrunch and summarize the main themes"
"Extract all articles published today from https://bbc.com/news and identify which topics appear most frequently"
"Get metadata for the last 50 articles from https://blog.apify.com and tell me the average word count per post"

Learn more in the Apify MCP documentation.

⚖️ Is it legal to scrape news articles?

News & Article Extractor only accesses publicly available pages — the same content any web browser would see. It does not bypass authentication, circumvent paywalls, or access restricted content.

Best practices:

Respect robots.txt guidelines for the sites you scrape
Do not scrape personal data beyond what's in public article bylines
Check each website's terms of service regarding automated access
Use the data ethically — for research, monitoring, and analysis, not plagiarism or content theft
The actor does not store or redistribute copyrighted content — it extracts it to your own Apify dataset

For more information, see Apify's web scraping guide on legality.

❓ FAQ

How does article discovery work? The extractor tries three methods in order: (1) RSS/Atom feed detection via <link rel="alternate"> tags or common paths like /feed and /rss.xml; (2) sitemap.xml parsing; (3) HTML link extraction from the homepage. If one method fails, it falls back to the next automatically.

Why are some articles returning empty content? Some sites use JavaScript to render their content (e.g., heavy React/Next.js sites). Since this extractor uses pure HTTP (no browser), JavaScript-rendered content won't be visible. If you see success: true but empty content, the site likely requires JavaScript rendering. Try extracting metadata only from the RSS feed instead.

How fast is extraction? Metadata-only mode (RSS) processes 50-100 articles in under 10 seconds — it's just parsing a feed. Full content extraction takes 1-3 seconds per article depending on page size and server speed. 20 articles typically complete in 30-60 seconds.

Can I use this with paywalled sites? No — the extractor does not support authentication or login. It can only access content that's publicly visible without logging in. Some sites offer free articles up to a limit before showing a paywall.

Why does the actor find fewer articles than expected? The maxArticles limit per site applies. Also, RSS feeds typically contain only the 20-50 most recent articles. For older content, use the sitemap method by entering the site URL (not the RSS URL) — sitemaps often contain thousands of URLs.

Why are images not showing up? Some sites serve images with lazy-loading (no src attribute until JavaScript runs) or use CSS backgrounds instead of <img> tags. The OG image (og:image meta tag) is always captured when available. Enable includeImages: true and check the imageUrl field first.

🔗 Other content scrapers

Google News Scraper — scrape Google News results by keyword
Bing News Scraper — extract news from Bing News search
HackerNews Scraper — scrape Hacker News posts and comments
Webpage to Markdown Converter — convert any webpage to clean Markdown for LLMs
ArXiv Scraper — extract academic papers from arXiv.org

Fast News Content Scraper

datapilot/fast-news-content-scraper

Fast News Content Scraper Actor collects news articles using Fast News RSS and . It extracts title, URL, publish date, author, description, and full article text. Supports multiple queries, anti-bot delays, and outputs structured JSON with source site and scrape timestamp.

Data Pilot

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

396

4.8

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

RSS & News Feed Aggregator — Multi-Source Article Scraper

joyouscam35875/rss-news-aggregator

Aggregate and parse RSS/Atom feeds from any source. Extract articles with titles, descriptions, authors, dates, images. Optionally fetch full article content. Perfect for news monitoring and AI pipelines. $0.0005/article.

Ken Digital

Google News Scraper

scrapeai/google-news-scraper

Scrape Google News articles from news.google.com using any search query. Extract title, source, date, link, and snippet. Optional deep scrape visits each article to collect full text, author, images, keywords, metadata, word count, and reading time.

ScrapeAI

5.0

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Convert Any Website to RSS Feed

thescrapelab/any-website-to-rss-feed

Turn blogs, news pages, job boards, product listings, directories, and sitemaps into RSS feeds, JSON Feed, and structured datasets with change detection.

Inus Grobler

Google News Scraper – News Monitoring & Article Data Extractor

epicscrapers/google-news-scraper

Extract title, URL, source, publish date, and thumbnail image etc. Perfect for news monitoring, research, and media tracking workflows.

Epic Scrapers

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

173

Yahoo News Scraper

piotrv1001/yahoo-news-scraper

Scrapes news articles from Yahoo News categories, extracting titles, authors, sources, publication dates, descriptions, images, and full article body text. Ideal for media monitoring, trend analysis, and news aggregation.