Pricing

Pay per event

News Article Extractor for AI & RAG

Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Mohieldin Mohamed

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What does News Article Extractor for AI & RAG do?

This actor fetches any article URL and runs a layered extraction pipeline to get the cleanest possible text:

JSON-LD schemas - Most news sites publish NewsArticle / Article structured data. This is the highest-fidelity source for title, author, and publish date.
Open Graph + Twitter Cards - Fallback metadata used by virtually every modern site.
<article> and [itemprop="articleBody"] tags - Semantic HTML extraction.
Readability heuristics - Longest <p> cluster for sites that don't use any of the above.

Noise (ads, nav bars, share buttons, newsletter forms, related-article widgets, paywalls) is stripped before content extraction. The final output is a clean text body plus all the metadata an LLM or analytics pipeline needs.

Why use News Article Extractor?

RAG pipelines - Ingest articles into vector databases without cleanup work. Every output already has a word count, reading time, and canonical URL.
LLM fine-tuning - Build high-quality training datasets of article bodies stripped of boilerplate.
Content monitoring - Track what a publisher is posting over time and pipe it into your analytics stack.
News aggregators - Build a Feedly clone or topic-tracking dashboard without scraping each site individually.
Sentiment analysis - Get clean text inputs for your NLP models without fighting site-specific HTML.
SEO research - Extract every competitor article on a topic and analyze their structure, word counts, and keywords.

Built on the Apify platform: scheduling, API access, proxy rotation, webhook integrations, and monitoring are included.

How to use News Article Extractor for AI & RAG

Click Try for free and sign in to Apify
Paste the article URLs you want to extract into the Article URLs field
(Optional) Set a Minimum word count to skip homepages and category listings
Click Start - the actor processes URLs in parallel
Open the Output tab to view or download results

You can also trigger the actor from your own code via the Apify API - pass a list of URLs in the JSON body and poll for results.

Input

{
    "startUrls": [
        { "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o" },
        { "url": "https://techcrunch.com/2026/04/12/ai-roundup" }
    ],
    "minWordCount": 200,
    "includeHtml": false,
    "maxRequestsPerCrawl": 100
}

Field	Type	Description
`startUrls`	array	List of URLs to extract. Each entry is `{ "url": "..." }`. Required.
`minWordCount`	integer	Skip articles shorter than this. Default: 0 (accept all).
`includeHtml`	boolean	Also return raw HTML. Default: false.
`maxRequestsPerCrawl`	integer	Safety cap on requests. Default: 100, max: 5000.

Output

{
    "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o",
    "statusCode": 200,
    "title": "Major AI breakthrough announced today",
    "description": "Researchers report new advances...",
    "authors": ["Jane Doe"],
    "publishedAt": "2026-04-13T08:00:00Z",
    "modifiedAt": "2026-04-13T10:15:00Z",
    "image": "https://ichef.bbci.co.uk/news/1024/...",
    "siteName": "BBC News",
    "language": "en",
    "content": "The full cleaned body of the article...",
    "wordCount": 842,
    "readingTimeMinutes": 4,
    "keywords": ["AI", "machine learning", "research"],
    "canonicalUrl": "https://www.bbc.com/news/articles/cq8v4dqj9y7o",
    "extractionMethod": "jsonld",
    "extractedAt": "2026-04-13T19:42:17.301Z"
}

You can download the dataset in various formats such as JSON, HTML, CSV, or Excel from the Output tab.

Output fields

Field	Type	Description
`url`	string	The canonical URL of the article
`title`	string	Article headline
`description`	string	Summary / subtitle
`authors`	array	List of author names
`publishedAt`	string	ISO timestamp of publication
`modifiedAt`	string	ISO timestamp of last edit
`image`	string	Lead image URL
`siteName`	string	Publisher site name
`language`	string	ISO 639 language code
`content`	string	Clean body text with noise removed
`wordCount`	integer	Number of words in the content
`readingTimeMinutes`	integer	Estimated reading time at 220 wpm
`keywords`	array	Article tags and keywords
`canonicalUrl`	string	Canonical URL from `<link rel="canonical">`
`extractionMethod`	string	Which extraction strategy succeeded (`jsonld`, `article-tag`, `readability`)

How much does it cost to extract news articles?

The actor uses a Cheerio crawler (no headless browser) with 8 concurrent requests. Extracting 100 articles typically consumes a few cents of platform credit on Apify. The free tier covers thousands of extractions per month.

Tips and advanced options

Feed a sitemap - Want every article from a publisher? Pass the sitemap URLs and the extractor will process each one.
Filter noise with minWordCount - Set it to 200 or 300 to automatically skip homepages, tag pages, and author pages.
Schedule incremental crawls - Use Apify Schedules to re-run daily against an RSS feed and push new articles to your RAG database.
Integrate with LLM APIs - Chain this actor with an LLM summarization actor or a vector database webhook.

FAQ

Does it handle paywalled content? No. It only extracts content that is served in the public HTML. Paywalled pages will either return the preview or nothing.

Which sites are supported? Anything that serves HTML. The extractor is site-agnostic. It has been tested against BBC, TechCrunch, The Verge, NYT (public pages), Medium, Substack, WordPress blogs, and more.

Is this legal? The actor fetches publicly served HTML, the same way your browser does. It does not bypass paywalls, log in, or circumvent any access controls. You are responsible for respecting the terms of service of the sites you scrape and for complying with copyright when using extracted content.

Why not use a headless browser? Headless browsers are 10-20x slower and cost 10-20x more. For news and blog content, HTTP + Cheerio works on the vast majority of sites. If you need JS-heavy sites, consider pairing this actor with a dedicated browser-based one.

Support

Found an article that fails to extract cleanly? Open an issue with the URL and we will tune the extractor.

Article Extractor - Clean Text for LLM & RAG Pipelines

pattonholdings/article-extractor

Extract clean article text + metadata from any URL: title, author, publish date, full plain text, top image, word count. JSON-LD + Open Graph + readability heuristics, no browser. Use for LLM/RAG ingestion, news monitoring, research agents. Input: url or urls[] (max 1000). Output: JSON.

Coleton Patton

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Lightkong

Google News Article Scraper

webscrap18/google-news-article-scraper

Scrape Google News, Extract full content with Title, Article Text, Images and Structured data.

WebScrap

News & Article Extractor

automation-lab/news-article-extractor

Auto-discover news/blog articles and extract clean text plus Markdown for LLM/RAG corpora. Uses RSS, sitemaps, and Readability; outputs metadata, counts, and token estimates.

Stas Persiianenko

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

407

4.8

Google News Scraper

futurizerush/google-news-scraper

Google News Search Scraper - Real-time news aggregation from Google News. Features smart article enrichment with full content extraction. Perfect for market research, trend analysis, and content monitoring.

Rush

114

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

183

News Article & RSS Crawler — Clean Text for RAG

ahampton83/news-article-crawler

Fetch news from RSS feeds and Google News search, then extract clean article text and metadata. Perfect for RAG pipelines, newsletters, trend monitoring, and AI agents. Use via Apify Console/API or connect as an MCP server for Claude, Cursor, and other AI agents.

Aaron Hampton

News Aggregator - RSS Feed Parser & Article Extractor

klondikeking/news-aggregator

Extract structured news articles from any RSS feed. Get headlines, summaries, publication dates, authors, and source URLs in clean JSON. Perfect for media monitoring, content curation, and news aggregation pipelines.

Pierrick McD0nald

📰 Extract Google News Articles — AI & RAG Ready

muhammadafzal/google-news-scraper

Extract Google News articles by keyword, topic, or URL with full-text extraction for AI/RAG pipelines. Get headlines, sources, snippets, images, authors, and clean article text in structured JSON. Export scraped data, run the scraper via API, or integrate with other tools.