Smart Article & Blog Extractor
Pricing
from $0.50 / 1,000 results
Smart Article & Blog Extractor
Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer
Tan Yegen
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
21 hours ago
Last modified
Categories
Share
🧠 Smart Article & Blog Extractor
The ultimate tool for LLMs, RAG pipelines, and Content Analyzers. Extract clean, ad-free text from any news site, blog, or article in seconds.
Why this Actor?
When you train AI models or build RAG (Retrieval-Augmented Generation) systems, you don't want menus, sidebars, cookie popups, or footer links ruining your dataset. You only want the Title, Author, and the actual Content.
This actor uses Mozilla's powerful Readability algorithm (the same engine that powers Firefox's Reader View) to automatically strip away all the junk and give you a beautifully clean text output.
Advantages:
- Universal: Works on Medium, TechCrunch, WordPress blogs, Substack, CNN, NYTimes, and 99% of other article pages.
- Ultra-Fast: Uses HTTP requests (
CheerioCrawler), extracting articles in less than a second per page. - Cost-Effective: Because it doesn't open heavy browsers, your Apify Compute Unit (CU) costs are practically zero.
💰 Pricing: Pay-Per-Result
We charge only $0.50 per 1,000 articles extracted.
📥 Input Schema
| Field | Type | Description |
|---|---|---|
startUrls | Array | A list of article or blog URLs you want to extract. |
proxyConfiguration | Object | Standard Apify proxy settings to bypass IP blocks. |
📤 Output Schema
For each URL, the actor will produce a clean JSON object.
{"url": "https://techcrunch.com/2023/12/20/example-article/","title": "The Future of Artificial Intelligence","author": "Jane Doe","publishedTime": "2023-12-20T10:00:00Z","siteName": "TechCrunch","textContent": "Artificial intelligence has been evolving rapidly... (clean text continues)","readingTimeMins": 4,"scrapedAt": "2026-04-30T17:30:00.000Z"}
Start extracting clean knowledge today!