Smart Article & Blog Extractor avatar

Smart Article & Blog Extractor

Pricing

from $0.50 / 1,000 results

Go to Apify Store
Smart Article & Blog Extractor

Smart Article & Blog Extractor

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Tan Yegen

Tan Yegen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

21 hours ago

Last modified

Share

🧠 Smart Article & Blog Extractor

The ultimate tool for LLMs, RAG pipelines, and Content Analyzers. Extract clean, ad-free text from any news site, blog, or article in seconds.

Why this Actor?

When you train AI models or build RAG (Retrieval-Augmented Generation) systems, you don't want menus, sidebars, cookie popups, or footer links ruining your dataset. You only want the Title, Author, and the actual Content.

This actor uses Mozilla's powerful Readability algorithm (the same engine that powers Firefox's Reader View) to automatically strip away all the junk and give you a beautifully clean text output.

Advantages:

  • Universal: Works on Medium, TechCrunch, WordPress blogs, Substack, CNN, NYTimes, and 99% of other article pages.
  • Ultra-Fast: Uses HTTP requests (CheerioCrawler), extracting articles in less than a second per page.
  • Cost-Effective: Because it doesn't open heavy browsers, your Apify Compute Unit (CU) costs are practically zero.

💰 Pricing: Pay-Per-Result

We charge only $0.50 per 1,000 articles extracted.

📥 Input Schema

FieldTypeDescription
startUrlsArrayA list of article or blog URLs you want to extract.
proxyConfigurationObjectStandard Apify proxy settings to bypass IP blocks.

📤 Output Schema

For each URL, the actor will produce a clean JSON object.

{
"url": "https://techcrunch.com/2023/12/20/example-article/",
"title": "The Future of Artificial Intelligence",
"author": "Jane Doe",
"publishedTime": "2023-12-20T10:00:00Z",
"siteName": "TechCrunch",
"textContent": "Artificial intelligence has been evolving rapidly... (clean text continues)",
"readingTimeMins": 4,
"scrapedAt": "2026-04-30T17:30:00.000Z"
}

Start extracting clean knowledge today!