News Article Extractor for AI & RAG
Pricing
Pay per event
News Article Extractor for AI & RAG
Extract clean, structured JSON from any news article or blog post - title, authors, published date, full content, keywords, images. Perfect for LLM training data, RAG pipelines, content monitoring and news aggregation. Uses JSON-LD, Open Graph and readability heuristics.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Mohieldin Mohamed
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Turn any news article or blog post URL into clean, structured JSON in one API call. News Article Extractor pulls the title, authors, publish date, full content, keywords, and images from any news site or blog - ready to drop straight into your LLM training pipeline, RAG system, or content database.
No more writing custom CSS selectors for every new site. No more stripping ads, nav bars, and cookie banners by hand. Paste a URL, get a perfect JSON payload.
What does News Article Extractor for AI & RAG do?
This actor fetches any article URL and runs a layered extraction pipeline to get the cleanest possible text:
- JSON-LD schemas - Most news sites publish
NewsArticle/Articlestructured data. This is the highest-fidelity source for title, author, and publish date. - Open Graph + Twitter Cards - Fallback metadata used by virtually every modern site.
<article>and[itemprop="articleBody"]tags - Semantic HTML extraction.- Readability heuristics - Longest
<p>cluster for sites that don't use any of the above.
Noise (ads, nav bars, share buttons, newsletter forms, related-article widgets, paywalls) is stripped before content extraction. The final output is a clean text body plus all the metadata an LLM or analytics pipeline needs.
Why use News Article Extractor?
- RAG pipelines - Ingest articles into vector databases without cleanup work. Every output already has a word count, reading time, and canonical URL.
- LLM fine-tuning - Build high-quality training datasets of article bodies stripped of boilerplate.
- Content monitoring - Track what a publisher is posting over time and pipe it into your analytics stack.
- News aggregators - Build a Feedly clone or topic-tracking dashboard without scraping each site individually.
- Sentiment analysis - Get clean text inputs for your NLP models without fighting site-specific HTML.
- SEO research - Extract every competitor article on a topic and analyze their structure, word counts, and keywords.
Built on the Apify platform: scheduling, API access, proxy rotation, webhook integrations, and monitoring are included.
How to use News Article Extractor for AI & RAG
- Click Try for free and sign in to Apify
- Paste the article URLs you want to extract into the Article URLs field
- (Optional) Set a Minimum word count to skip homepages and category listings
- Click Start - the actor processes URLs in parallel
- Open the Output tab to view or download results
You can also trigger the actor from your own code via the Apify API - pass a list of URLs in the JSON body and poll for results.
Input
{"startUrls": [{ "url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o" },{ "url": "https://techcrunch.com/2026/04/12/ai-roundup" }],"minWordCount": 200,"includeHtml": false,"maxRequestsPerCrawl": 100}
| Field | Type | Description |
|---|---|---|
startUrls | array | List of URLs to extract. Each entry is { "url": "..." }. Required. |
minWordCount | integer | Skip articles shorter than this. Default: 0 (accept all). |
includeHtml | boolean | Also return raw HTML. Default: false. |
maxRequestsPerCrawl | integer | Safety cap on requests. Default: 100, max: 5000. |
Output
{"url": "https://www.bbc.com/news/articles/cq8v4dqj9y7o","statusCode": 200,"title": "Major AI breakthrough announced today","description": "Researchers report new advances...","authors": ["Jane Doe"],"publishedAt": "2026-04-13T08:00:00Z","modifiedAt": "2026-04-13T10:15:00Z","image": "https://ichef.bbci.co.uk/news/1024/...","siteName": "BBC News","language": "en","content": "The full cleaned body of the article...","wordCount": 842,"readingTimeMinutes": 4,"keywords": ["AI", "machine learning", "research"],"canonicalUrl": "https://www.bbc.com/news/articles/cq8v4dqj9y7o","extractionMethod": "jsonld","extractedAt": "2026-04-13T19:42:17.301Z"}
You can download the dataset in various formats such as JSON, HTML, CSV, or Excel from the Output tab.
Output fields
| Field | Type | Description |
|---|---|---|
url | string | The canonical URL of the article |
title | string | Article headline |
description | string | Summary / subtitle |
authors | array | List of author names |
publishedAt | string | ISO timestamp of publication |
modifiedAt | string | ISO timestamp of last edit |
image | string | Lead image URL |
siteName | string | Publisher site name |
language | string | ISO 639 language code |
content | string | Clean body text with noise removed |
wordCount | integer | Number of words in the content |
readingTimeMinutes | integer | Estimated reading time at 220 wpm |
keywords | array | Article tags and keywords |
canonicalUrl | string | Canonical URL from <link rel="canonical"> |
extractionMethod | string | Which extraction strategy succeeded (jsonld, article-tag, readability) |
How much does it cost to extract news articles?
The actor uses a Cheerio crawler (no headless browser) with 8 concurrent requests. Extracting 100 articles typically consumes a few cents of platform credit on Apify. The free tier covers thousands of extractions per month.
Tips and advanced options
- Feed a sitemap - Want every article from a publisher? Pass the sitemap URLs and the extractor will process each one.
- Filter noise with
minWordCount- Set it to 200 or 300 to automatically skip homepages, tag pages, and author pages. - Schedule incremental crawls - Use Apify Schedules to re-run daily against an RSS feed and push new articles to your RAG database.
- Integrate with LLM APIs - Chain this actor with an LLM summarization actor or a vector database webhook.
FAQ
Does it handle paywalled content? No. It only extracts content that is served in the public HTML. Paywalled pages will either return the preview or nothing.
Which sites are supported? Anything that serves HTML. The extractor is site-agnostic. It has been tested against BBC, TechCrunch, The Verge, NYT (public pages), Medium, Substack, WordPress blogs, and more.
Is this legal? The actor fetches publicly served HTML, the same way your browser does. It does not bypass paywalls, log in, or circumvent any access controls. You are responsible for respecting the terms of service of the sites you scrape and for complying with copyright when using extracted content.
Why not use a headless browser? Headless browsers are 10-20x slower and cost 10-20x more. For news and blog content, HTTP + Cheerio works on the vast majority of sites. If you need JS-heavy sites, consider pairing this actor with a dedicated browser-based one.
Support
Found an article that fails to extract cleanly? Open an issue with the URL and we will tune the extractor.