Pricing

from $40.00 / 1,000 results

Smart Article Extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

Pricing

from $40.00 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

📰 Smart Article Extractor

🚀 Parse any news article or blog post into clean structured text in seconds. Get 23 metadata fields per article including authors, tags, publish date, lead image, paywall flag, and reading time. No API key, no registration, no manual parser maintenance.

Pull structured records from Smart Article Extractor — clean fields ready as CSV, JSON, JSONL, Excel, or XML for downstream pipelines.

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parseforge/article-extractor on Apify. Call: ApifyClient("TOKEN").actor("parseforge/article-extractor").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for results. Key inputs: startUrls (array, default [{"url": "https://www.bbc.com/news/articles/c86w8elez74o"}]), maxItems (integer, default 10). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parseforge~article-extractor (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

The Smart Article Extractor takes any article URL and returns the main body as clean Markdown alongside 22 metadata fields. It scores DOM nodes by paragraph count, word count, and link density to identify the main content block, then strips navigation, sidebars, and ads. Author, tags, section, publishedAt, modifiedAt, and canonical URL are pulled from meta tags, JSON-LD, and itemprop attributes.

Extras include a paywall-detection heuristic, inline image collection, lead image (Open Graph), language detection, word count, and reading time. Concurrent fetching keeps 10 articles flying in parallel, so a list of 100 news URLs finishes in about 15 seconds. Works out of the box on most major news sites, blogs, and publishing platforms.

🎯 Target Audience	💡 Primary Use Cases
News aggregators, media monitoring teams, AI app developers, content researchers, data journalists, archivists	News datasets, summarization pipelines, media monitoring, sentiment analysis, archive assembly

📋 What the Smart Article Extractor does

Five extraction workflows in a single run:

📝 Main body extraction. DOM scoring isolates the article content and strips navigation, ads, and sidebars.
👥 Author detection. Pulls authors from meta tags, JSON-LD, and itemprop attributes.
📅 Date stamps. Captures both article:published_time and article:modified_time.
🏷️ Tags and section. Extracts article:tag and article:section metadata.
💰 Paywall flag. Heuristic detects common paywall markers so you can filter downstream.

Every record also includes the canonical URL, lead image, inline images, word count, reading time, language, site name, HTTP status, and timestamp.

💡 Why it matters: news sites each have their own HTML structure. Writing per-site parsers is brittle and breaks every time a publisher redesigns their pages. This Actor uses readability-style scoring that works across any article-shaped page.

📊 Data fields

Each record includes: author, authors, canonicalUrl, description, hasPaywall, httpStatus, images, language, leadImage, markdown, modifiedAt, publishedAt, readingTimeMinutes, results, scrapedAt, section, siteName, subtitle, tags, title, url, wordCount. These field names come straight from the actor's dataset schema, so what you see here is what lands in your dataset.

⚠️ Good to Know: works best on article-shaped pages (one headline, one author, one body). Homepages, category pages, and list views return thin extractions because there is no single article to score.

🚀 How to use

📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
🌐 Open the Actor. Go to the Smart Article Extractor page on the Apify Store.
🎯 Paste URLs. Add article URLs to the startUrls field and set maxItems.
🚀 Run it. Click Start and let the Actor extract the content.
📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.

🔗 Recommended Actors

🤖 RAG Web Browser - Search or fetch URLs with LLM-ready output
🕸️ Website Content Crawler - Deep-crawl a domain with depth control
🔍 Google Search Scraper - SERP results with rank and description
📈 Google Trends Scraper - Interest over time and related queries
📧 Contact Info Scraper - Emails, phones, and socials from URLs

💡 Pro Tip: browse the complete ParseForge collection for more content-extraction tools.

⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any publisher, news outlet, or readability library. Only publicly accessible article URLs are processed. Respect the copyright and terms of service of every publisher you extract from.

🆘 Need Help?

If you hit a bug, have questions about setup, or need a scraper we haven't built yet, open our contact form or write to parseforge@protonmail.com. We also take on paid custom data projects.

For faster answers, join our Discord. It's the best place to get support and suggest new actors.

Smart Article & Blog Extractor

lightkong/universal-blog-scraper

Extract clean text, author, title, and reading time from any news, blog, or article webpage. Perfect for AI/LLM training and RAG systems.

Lightkong

Article Content Extractor

codingfrontend/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Returns title, description, author, publish date, plain content, word count, images, and more.

Coding Frontned

News Article Extractor

hermesexp/news-article-extractor

Extract clean article content from news websites. Get headline, author, date, body text, and images. Great for content curation and research.

Dima Radov

🧠 Smart Article Extractor

scraper-engine/smart-article-extractor

Smart Article Extractor by Extract clean article content from any webpage, including titles, authors, publish dates, text, images, metadata, and links. Remove ads and clutter automatically. Export structured data to JSON, CSV, Excel, or XML for research, AI training, content analysis, and archiving.

Scraper Engine

Smart Article Extractor

datapilot/smart-article-extractor

News Article Extractor Actor fetches article URLs and extracts structured content using Requests, , and Newspaper3k. It collects title, author, publish date, text, summary, keywords, images, and word count. Supports proxy use and outputs clean JSON results.

Data Pilot

Article / Content Extractor

chuckling_hemp/article-extractor

Extract the main readable content from any article or blog URL: title, author, published date, full body text, word count, lead image, and site name. Uses JSON-LD, Open Graph meta, and DOM fallbacks — fast and reliable, no login or browser needed for public pages.

Matt Cook

News Article Scraper — Newsroom & Press Release Extractor

scrapepilot/company-ok

Scrape full article content from any newsroom, press release page, or blog. Get title, author, publish date, summary, SEO keywords, word count, and full body text. Auto-discovers article links. Checkpoint resume. $5 per 1,000 articles

Scrape Pilot

AI Blog Dataset Creator

datapilot/ai-blog-dataset-creator

Smart Article Scraper Actor extracts structured article data from URLs using, and Newspaper3k. It collects title, author, publish date, tags, full content, language, and word count. Supports proxy usage, JavaScript-rendered pages, and outputs clean JSON datasets.

Data Pilot

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

148

5.0

Article Text Extractor (Readability)

bgfc97/article-text-extractor

Extract the main content of article/news/blog pages — title, author, clean text, excerpt, word count and publish date — using Mozilla Readability. For content aggregation, summarization and RAG pipelines. No key, no proxy.