News & Article Extractor avatar

News & Article Extractor

Pricing

Pay per event

Go to Apify Store
News & Article Extractor

News & Article Extractor

Auto-discover and extract articles from news sites, blogs, and publications. Finds RSS feeds and sitemaps automatically. Outputs title, author, date, full text, images, and metadata. No proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

8

Total users

3

Monthly active users

17 days ago

Last modified

Categories

Share

Extract articles from any news website, blog, or publication — automatically. Give it a URL and it discovers articles via RSS feeds, sitemaps, or HTML crawling, then pulls the full text using @mozilla/readability.

No API key needed. No browser overhead. Just pure HTTP extraction.

📰 What does News & Article Extractor do?

News & Article Extractor auto-discovers and extracts articles from news sites, blogs, and academic publications. Point it at any website — TechCrunch, BBC News, your company blog, or any RSS feed — and it returns structured article data: title, author, publish date, full text content, images, and more.

The extractor uses a three-tier discovery strategy:

  1. RSS auto-discovery — detects RSS/Atom feeds from <link rel="alternate"> tags or common paths (/feed, /rss.xml)
  2. sitemap.xml — parses XML sitemaps including news sitemaps for systematic URL discovery
  3. HTML crawl — falls back to extracting article links from the homepage

Once articles are found, @mozilla/readability (the same engine Firefox Reader View uses) strips navigation, ads, and boilerplate to return clean article text.

👤 Who is News & Article Extractor for?

Researchers and academics monitoring a beat or topic:

  • Track publications across dozens of news sources daily
  • Build training datasets from articles across multiple blogs
  • Monitor academic preprint servers and publication feeds

Content marketers and SEO teams:

  • Audit competitor blog content and publishing cadence
  • Aggregate industry news for internal newsletters
  • Monitor brand mentions across news publications

Data scientists and ML engineers:

  • Build NLP training corpora from news articles
  • Create RAG (Retrieval-Augmented Generation) knowledge bases
  • Feed structured article data into analysis pipelines

Business intelligence teams:

  • Monitor competitor press releases and announcements
  • Track industry trends from multiple publications
  • Export article data to Google Sheets, Airtable, or databases

✅ Why use News & Article Extractor?

  • Automatic discovery — no need to manually find RSS feeds or sitemaps; the extractor tries all methods automatically
  • Clean text extraction — @mozilla/readability removes ads, navigation, footers, and cookie banners
  • RSS metadata included — when articles come from RSS, you get author, date, and description for free (no extra HTTP request)
  • Metadata-only mode — set extractFullContent: false to get just titles, dates, and URLs blazing fast and at minimal cost
  • Date filtering — filter articles by publication date range to get only recent content
  • No proxy needed — most news sites are publicly accessible; pure HTTP extraction
  • Structured output — every article outputs the same fields: title, author, publishedDate, content, wordCount, imageUrl, images, sourceDomain
  • Graceful error handling — failed articles are skipped and logged; the run continues
  • No API key or login required

📊 What data can you extract?

FieldDescriptionType
titleArticle headlinestring
authorByline / author namestring
publishedDateISO 8601 publication datestring
descriptionArticle summary or meta descriptionstring
contentFull article body text (plain text)string
wordCountNumber of words in article contentnumber
urlCanonical article URLstring
imageUrlPrimary/OG image URLstring
imagesAll image URLs found in articlearray
sourceDomainDomain of the source sitestring
sourceUrlRoot URL of the source sitestring
discoveryMethodHow the article was found: rss, sitemap, or html-crawlstring
extractedAtTimestamp of extractionstring
successWhether extraction succeededboolean
errorError message if extraction failedstring

💰 How much does it cost to extract news articles?

News & Article Extractor uses Pay-Per-Event (PPE) pricing — you pay only for results, not for compute time.

EventFREE tierBRONZESILVERGOLD
Actor start (one-time)$0.005$0.005$0.005$0.005
Per article extracted$0.0023$0.002$0.00156$0.0012

Real-world cost examples:

  • Extract 10 articles: ~$0.025 (pennies)
  • Extract 100 articles: ~$0.205
  • Extract 1,000 articles: ~$2.005

On the free plan ($5 Apify credits), you can extract roughly 2,100+ articles before spending a cent of your own money.

Metadata-only mode (extractFullContent: false) is charged the same — the savings come from faster runs, not lower per-article cost.

🚀 How to extract articles from a news website

  1. Go to the News & Article Extractor page on Apify Store
  2. Click Try for free
  3. In the Website URLs field, enter one or more website URLs (e.g., https://techcrunch.com, https://bbc.com/news)
  4. Set Max Articles per Site — start with 10-20 for a quick test
  5. Leave Extract Full Content enabled to get full article text, or disable it for metadata-only (faster)
  6. Click Start and wait for results (typically 30-90 seconds for 10-20 articles)
  7. Export results as JSON, CSV, or Excel from the dataset tab

Input JSON examples:

Extract recent articles from two sources:

{
"startUrls": ["https://techcrunch.com", "https://theverge.com"],
"maxArticles": 20,
"extractFullContent": true,
"includeImages": true
}

Use an RSS feed URL directly:

{
"startUrls": ["https://feeds.bbci.co.uk/news/rss.xml"],
"maxArticles": 50,
"extractFullContent": true
}

Metadata-only, last 7 days:

{
"startUrls": ["https://blog.apify.com"],
"maxArticles": 100,
"extractFullContent": false,
"dateFrom": "2026-04-01"
}

⚙️ Input parameters

ParameterTypeDefaultDescription
startUrlsarrayrequiredWebsite URLs or RSS/sitemap URLs to process
maxArticlesinteger20Maximum articles to extract per site
extractFullContentbooleantrueFetch and extract full article body text
includeImagesbooleantrueInclude image URLs in output
dateFromstringOnly articles on/after this date (YYYY-MM-DD)
dateTostringOnly articles on/before this date (YYYY-MM-DD)
requestTimeoutinteger30HTTP request timeout in seconds
maxRetriesinteger2Retry attempts per failed request

Tips for startUrls:

  • You can enter full domain URLs (https://techcrunch.com) or direct RSS feed URLs (https://feeds.bbci.co.uk/news/rss.xml)
  • For sites with many sections, enter specific section URLs (e.g., https://bbc.com/news/technology)
  • Academic sites like arXiv work via HTML crawl: https://arxiv.org/list/cs.AI/recent

📤 Output examples

Full content extraction:

{
"url": "https://techcrunch.com/2026/04/07/waymo-opens-robotaxi-service-in-nashville/",
"title": "Waymo opens robotaxi service in Nashville, partners with Lyft",
"author": "Kirsten Korosec",
"publishedDate": "2026-04-07T14:00:00.000Z",
"description": "Waymo is expanding its robotaxi service beyond its current markets...",
"content": "Waymo is expanding its autonomous vehicle service to Nashville, Tennessee...",
"wordCount": 537,
"imageUrl": "https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg",
"images": ["https://techcrunch.com/wp-content/uploads/2026/04/waymo.jpg"],
"sourceDomain": "techcrunch.com",
"sourceUrl": "https://techcrunch.com",
"discoveryMethod": "rss",
"extractedAt": "2026-04-07T14:05:22.000Z",
"success": true,
"error": null
}

Metadata-only mode (extractFullContent: false):

{
"url": "https://www.bbc.com/news/articles/cx23p6j5gxgo",
"title": "Artemis II crew head for home after historic lunar flyby",
"author": "Jonathan Amos",
"publishedDate": "2026-04-07T13:00:04.000Z",
"description": "The four astronauts flew closer to the Moon than any humans since Apollo 17 in 1972.",
"content": null,
"wordCount": 0,
"imageUrl": "https://ichef.bbci.co.uk/news/1024/branded_news/...jpg",
"images": [],
"sourceDomain": "bbc.com",
"discoveryMethod": "rss",
"success": true,
"error": null
}

💡 Tips for best results

  • Start with RSS feed URLs — if you know a site's RSS feed (/feed, /rss.xml), enter it directly. RSS feeds include metadata (author, date, description) without an extra HTTP request per article.
  • Use metadata-only mode for monitoring — when you just need to know what articles were published and when, disable extractFullContent. It's much faster and costs the same per-article.
  • Set date filters for recurring runs — schedule the actor daily and use dateFrom to avoid re-extracting old articles.
  • Lower maxArticles for quick tests — start with 5-10 articles to verify the site works before scaling up.
  • Paywalled sites — the extractor fetches pages as a regular browser would. Paywalled content that requires login won't be accessible.
  • JavaScript-heavy sites — some modern sites render content via JavaScript. If the extractor returns empty content for a site that has articles, the site may require a browser-based approach.
  • Multiple sections — for large news sites, add multiple section URLs to startUrls to cover more ground (e.g., both https://nytimes.com/section/technology and https://nytimes.com/section/business).

🔗 Integrations

News & Article Extractor → Google Sheets Schedule the actor to run daily and automatically append new articles to a Google Sheet. Use the built-in Apify → Google Sheets integration to build a living content archive. Great for editorial teams tracking industry news.

News & Article Extractor → Slack/Discord alerts Set up a webhook trigger: when the actor completes and finds new articles matching certain keywords, post a summary to your Slack channel. Perfect for brand monitoring or competitor tracking.

News & Article Extractor → Make/Zapier content pipeline Connect via Apify's Make or Zapier integration to route new articles to your CMS, Notion database, or email newsletter tool. Build a fully automated content curation pipeline.

Scheduled monitoring runs Schedule runs every hour or day using Apify's built-in scheduler. Combine with date filters to only extract articles published since the last run. No duplicates, no manual work.

News & Article Extractor → RAG / LLM pipeline Export article content to your vector database (Pinecone, Weaviate, Chroma) for retrieval-augmented generation. The content field gives you clean plain text ready for embedding.

🖥️ Using the Apify API

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('automation-lab/news-article-extractor').call({
startUrls: ['https://techcrunch.com', 'https://theverge.com'],
maxArticles: 50,
extractFullContent: true,
includeImages: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} articles`);
items.slice(0, 3).forEach(article => {
console.log(`[${article.publishedDate}] ${article.title}${article.wordCount} words`);
});

Python

from apify_client import ApifyClient
client = ApifyClient(token='YOUR_API_TOKEN')
run = client.actor('automation-lab/news-article-extractor').call(run_input={
'startUrls': ['https://techcrunch.com', 'https://theverge.com'],
'maxArticles': 50,
'extractFullContent': True,
'includeImages': True,
})
dataset = client.dataset(run['defaultDatasetId'])
articles = dataset.list_items().items
print(f'Extracted {len(articles)} articles')
for article in articles[:3]:
print(f"[{article['publishedDate']}] {article['title']}{article['wordCount']} words")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~news-article-extractor/runs" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"startUrls": ["https://techcrunch.com"],
"maxArticles": 20,
"extractFullContent": true
}'

🤖 Use with AI agents via MCP

News & Article Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client — this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/news-article-extractor"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com?tools=automation-lab/news-article-extractor"
}
}
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

  • "Use automation-lab/news-article-extractor to extract the 20 most recent articles from TechCrunch and summarize the main themes"
  • "Extract all articles published today from https://bbc.com/news and identify which topics appear most frequently"
  • "Get metadata for the last 50 articles from https://blog.apify.com and tell me the average word count per post"

Learn more in the Apify MCP documentation.

News & Article Extractor only accesses publicly available pages — the same content any web browser would see. It does not bypass authentication, circumvent paywalls, or access restricted content.

Best practices:

  • Respect robots.txt guidelines for the sites you scrape
  • Do not scrape personal data beyond what's in public article bylines
  • Check each website's terms of service regarding automated access
  • Use the data ethically — for research, monitoring, and analysis, not plagiarism or content theft
  • The actor does not store or redistribute copyrighted content — it extracts it to your own Apify dataset

For more information, see Apify's web scraping guide on legality.

❓ FAQ

How does article discovery work? The extractor tries three methods in order: (1) RSS/Atom feed detection via <link rel="alternate"> tags or common paths like /feed and /rss.xml; (2) sitemap.xml parsing; (3) HTML link extraction from the homepage. If one method fails, it falls back to the next automatically.

Why are some articles returning empty content? Some sites use JavaScript to render their content (e.g., heavy React/Next.js sites). Since this extractor uses pure HTTP (no browser), JavaScript-rendered content won't be visible. If you see success: true but empty content, the site likely requires JavaScript rendering. Try extracting metadata only from the RSS feed instead.

How fast is extraction? Metadata-only mode (RSS) processes 50-100 articles in under 10 seconds — it's just parsing a feed. Full content extraction takes 1-3 seconds per article depending on page size and server speed. 20 articles typically complete in 30-60 seconds.

Can I use this with paywalled sites? No — the extractor does not support authentication or login. It can only access content that's publicly visible without logging in. Some sites offer free articles up to a limit before showing a paywall.

Why does the actor find fewer articles than expected? The maxArticles limit per site applies. Also, RSS feeds typically contain only the 20-50 most recent articles. For older content, use the sitemap method by entering the site URL (not the RSS URL) — sitemaps often contain thousands of URLs.

Why are images not showing up? Some sites serve images with lazy-loading (no src attribute until JavaScript runs) or use CSS backgrounds instead of <img> tags. The OG image (og:image meta tag) is always captured when available. Enable includeImages: true and check the imageUrl field first.

🔗 Other content scrapers