Pricing

from $0.50 / 1,000 results

Smart Article & Blog Extractor

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Tan Yegen

Actor stats

Bookmarked

Total users

Monthly active users

21 hours ago

Last modified

🧠 Smart Article & Blog Extractor

The ultimate tool for LLMs, RAG pipelines, and Content Analyzers. Extract clean, ad-free text from any news site, blog, or article in seconds.

Why this Actor?

When you train AI models or build RAG (Retrieval-Augmented Generation) systems, you don't want menus, sidebars, cookie popups, or footer links ruining your dataset. You only want the Title, Author, and the actual Content.

This actor uses Mozilla's powerful Readability algorithm (the same engine that powers Firefox's Reader View) to automatically strip away all the junk and give you a beautifully clean text output.

Advantages:

Universal: Works on Medium, TechCrunch, WordPress blogs, Substack, CNN, NYTimes, and 99% of other article pages.
Ultra-Fast: Uses HTTP requests (CheerioCrawler), extracting articles in less than a second per page.
Cost-Effective: Because it doesn't open heavy browsers, your Apify Compute Unit (CU) costs are practically zero.

💰 Pricing: Pay-Per-Result

We charge only $0.50 per 1,000 articles extracted.

📥 Input Schema

Field	Type	Description
`startUrls`	Array	A list of article or blog URLs you want to extract.
`proxyConfiguration`	Object	Standard Apify proxy settings to bypass IP blocks.

📤 Output Schema

For each URL, the actor will produce a clean JSON object.

{
  "url": "https://techcrunch.com/2023/12/20/example-article/",
  "title": "The Future of Artificial Intelligence",
  "author": "Jane Doe",
  "publishedTime": "2023-12-20T10:00:00Z",
  "siteName": "TechCrunch",
  "textContent": "Artificial intelligence has been evolving rapidly... (clean text continues)",
  "readingTimeMins": 4,
  "scrapedAt": "2026-04-30T17:30:00.000Z"
}

Start extracting clean knowledge today!

Article Extractor & News Scraper

web.harvester/article-extractor-news-scraper

Extract articles from any news site, blog, or webpage. Get title, full text, author, date, images & metadata using 7 extraction engines (Newspaper4k, Trafilatura, Goose3). Anti-bot bypass, proxy rotation, automatic fallback. Perfect for news monitoring, NLP datasets & content aggregation.

Web Harvester

5.0

(2)

Blog Scraper

naive_zing/blog-scraper

Company Blog Scraper, Blog Post Scraper, Corporate Blog Crawler, Automatic Blog Discovery, Blog Content Extractor, Article Metadata Scraper, Multi-Domain Blog Scraper, Competitor Blog Analysis, Content Marketing Scraper, Blog Post Metadata Extraction, Company Announcements Scraper.

Wyald

Universal Article Scraper

universal_scraping/universal-article-scraper

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.

Michael Novak

5.0

(1)

AI Blog Dataset Creator

datapilot/ai-blog-dataset-creator

Smart Article Scraper Actor extracts structured article data from URLs using, and Newspaper3k. It collects title, author, publish date, tags, full content, language, and word count. Supports proxy usage, JavaScript-rendered pages, and outputs clean JSON datasets.

Data Pilot

Article Extraction API

tugelbay/article-extractor

Convert article URLs to clean Markdown, text, or HTML for RAG and LLM pipelines. Extract title, author, date, images, links, word count, canonical URL, Open Graph, and JSON-LD metadata while removing ads and boilerplate.

Tugelbay Konabayev

Website to RSS Feed Generator

junipr/website-to-rss

Convert any website to RSS 2.0. Smart content detection finds articles automatically. CSS selectors for custom targeting. Configurable field mapping. Schedule for auto updates. Output as valid RSS XML.

junipr

Smart Article Extractor

parseforge/article-extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

ParseForge

Universal RAG Web Scraper

express_kingfisher/rag-web-scraper

Turn any website into clean, LLM-ready Markdown. Automatically strips ads, navigation, and noise using Mozilla Readability. Perfect for feeding data to ChatGPT, Claude, or Vector Databases (RAG).

Prince Raj

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Lukáš Křivka

7.4K

4.1

(9)

Website Metadata Extractor(sitemap, socialLinks, robotsTxt)

codescraper/website-metadata-extractor

A very fast metadata extractor to get all meta tags, robots.txt, sitemaps, social links, H1s, word count, and JSON-LD data. Also provides technology detection for a full analysis. Get your data fast for just $3/month.