Smart Article Extractor avatar

Smart Article Extractor

Pricing

from $40.00 / 1,000 results

Go to Apify Store
Smart Article Extractor

Smart Article Extractor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

Pricing

from $40.00 / 1,000 results

Rating

0.0

(0)

Developer

ParseForge

ParseForge

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 hours ago

Last modified

Share

ParseForge Banner

📰 Smart Article Extractor

🚀 Parse any news article or blog post into clean structured text in seconds. Get 23 metadata fields per article including authors, tags, publish date, lead image, paywall flag, and reading time. No API key, no registration, no manual parser maintenance.

🕒 Last updated: 2026-04-21 · 📊 23 fields per article · 🌐 Works on any site · ⚡ 10 articles in ~10 seconds · 💰 Paywall detection

The Smart Article Extractor takes any article URL and returns the main body as clean Markdown alongside 22 metadata fields. It scores DOM nodes by paragraph count, word count, and link density to identify the main content block, then strips navigation, sidebars, and ads. Author, tags, section, publishedAt, modifiedAt, and canonical URL are pulled from meta tags, JSON-LD, and itemprop attributes.

Extras include a paywall-detection heuristic, inline image collection, lead image (Open Graph), language detection, word count, and reading time. Concurrent fetching keeps 10 articles flying in parallel, so a list of 100 news URLs finishes in about 15 seconds. Works out of the box on most major news sites, blogs, and publishing platforms.

🎯 Target Audience💡 Primary Use Cases
News aggregators, media monitoring teams, AI app developers, content researchers, data journalists, archivistsNews datasets, summarization pipelines, media monitoring, sentiment analysis, archive assembly

📋 What the Smart Article Extractor does

Five extraction workflows in a single run:

  • 📝 Main body extraction. DOM scoring isolates the article content and strips navigation, ads, and sidebars.
  • 👥 Author detection. Pulls authors from meta tags, JSON-LD, and itemprop attributes.
  • 📅 Date stamps. Captures both article:published_time and article:modified_time.
  • 🏷️ Tags and section. Extracts article:tag and article:section metadata.
  • 💰 Paywall flag. Heuristic detects common paywall markers so you can filter downstream.

Every record also includes the canonical URL, lead image, inline images, word count, reading time, language, site name, HTTP status, and timestamp.

💡 Why it matters: news sites each have their own HTML structure. Writing per-site parsers is brittle and breaks every time a publisher redesigns their pages. This Actor uses readability-style scoring that works across any article-shaped page.


🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing extraction across news sites, blogs, and platforms.


⚙️ Input

InputTypeDefaultBehavior
startUrlsarray of URLsrequiredOne or more article URLs to extract.
maxItemsinteger10Articles returned. Free plan caps at 10, paid plan at 1,000,000.

Example: extract a single article.

{
"startUrls": [
{ "url": "https://techcrunch.com/2025/01/10/openai-launches-gpt-store/" }
],
"maxItems": 1
}

Example: batch extraction for media monitoring.

{
"startUrls": [
{ "url": "https://www.theverge.com/2025/ai-coverage-1" },
{ "url": "https://www.wired.com/story/ai-agents-2026" },
{ "url": "https://arstechnica.com/ai/article" }
],
"maxItems": 100
}

⚠️ Good to Know: works best on article-shaped pages (one headline, one author, one body). Homepages, category pages, and list views return thin extractions because there is no single article to score.


📊 Output

Each record contains 23 fields. Download the dataset as CSV, Excel, JSON, or XML.

🧾 Schema

FieldTypeExample
🔗 urlstring"https://techcrunch.com/.../gpt-store/"
🔁 canonicalUrlstring | null"https://techcrunch.com/.../gpt-store/"
🏷️ titlestring | null"OpenAI launches GPT Store"
📑 subtitlestring | null"Available to Plus, Team, Enterprise"
🧑 authorstring | null"Kyle Wiggers"
👥 authorsstring[]["Kyle Wiggers"]
📅 publishedAtISO 8601 | null"2025-01-10T14:00:00Z"
🔁 modifiedAtISO 8601 | null"2025-01-10T16:30:00Z"
🏢 siteNamestring | null"TechCrunch"
🗂️ sectionstring | null"AI"
🏷️ tagsstring[]["openai", "gpt-store"]
🌍 languagestring | null"en-US"
📝 descriptionstring | null"OpenAI rolled out the long-teased GPT Store..."
🖼️ leadImagestring | null"https://.../og.jpg"
🎨 imagesstring[]["https://...", "https://..."]
📃 markdownstring"# OpenAI launches GPT Store..."
💬 textstringplain text without markdown markers
🧾 htmlstringcleaned article HTML
🔢 wordCountnumber742
⏱️ readingTimeMinutesnumber4
💰 hasPaywallbooleanfalse
🟢 httpStatusnumber200
🕒 scrapedAtISO 8601"2026-04-21T12:00:00.000Z"
errorstring | null"Timeout" on failure

📦 Sample records


✨ Why choose this Actor

Capability
🧠DOM scoring. Readability-style extraction works across any article-shaped page without per-site rules.
📊23 fields. Authors, tags, section, dates, images, paywall, reading time, and canonical URL.
💰Paywall detection. Flags articles likely behind a paywall so you can filter them out.
Fast. 10 articles in under 10 seconds with parallel fetching.
🖼️Image capture. Lead image plus every inline image URL in the article body.
🚫No credentials. Runs on any public article URL.
🔌Integrations. Plugs into RSS feeds, newsroom tools, and news datasets.

📊 Clean article text is the foundation of news summarization, sentiment analysis, and media monitoring. This Actor delivers it consistently without per-site parsers.


📈 How it compares to alternatives

ApproachCostCoverageRefreshFiltersSetup
⭐ Smart Article Extractor (this Actor)$5 free credit, then pay-per-useAny public article URLLive per run23 metadata fields⚡ 2 min
Open-source readability libsFreeWhatever you hostYour codeWhatever you build🐢 Days
News API services$99+/monthCurated feedsReal-timePer-plan limits⏳ Hours
Paid media monitoring$$$+/monthManaged sourcesReal-timeRich UI🕒 Variable

Pick this Actor when you want article text from arbitrary URLs without maintaining your own extraction library.


🚀 How to use

  1. 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
  2. 🌐 Open the Actor. Go to the Smart Article Extractor page on the Apify Store.
  3. 🎯 Paste URLs. Add article URLs to the startUrls field and set maxItems.
  4. 🚀 Run it. Click Start and let the Actor extract the content.
  5. 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.

⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.


💼 Business use cases

📰 News Aggregation

  • Build custom news feeds across sources
  • Deduplicate stories across outlets
  • Normalize article structure for downstream apps
  • Feed summarization pipelines

🧠 AI & Summarization

  • Extract clean text for LLM summaries
  • Build news datasets for fine-tuning
  • Ground chatbots with current media
  • Power question-answering over news

📡 Media Monitoring

  • Track brand mentions across outlets
  • Monitor coverage of products or events
  • Capture executive quotes and bylines
  • Detect paywalled coverage to license

📚 Research & Archives

  • Build academic text corpora
  • Archive public journalism
  • Extract metadata for bibliographies
  • Preserve retracted or deleted articles

🔌 Automating Smart Article Extractor

Control the scraper programmatically for scheduled runs and pipeline integrations:

  • 🟢 Node.js. Install the apify-client NPM package.
  • 🐍 Python. Use the apify-client PyPI package.
  • 📚 See the Apify API documentation for full details.

The Apify Schedules feature lets you trigger this Actor on any cron interval. Pair it with an RSS reader or Google News feed for continuous media monitoring.


❓ Frequently Asked Questions


🔌 Integrate with any app

Smart Article Extractor connects to any cloud service via Apify integrations:

  • Make - Automate multi-step workflows
  • Zapier - Connect with 5,000+ apps
  • Slack - Post article summaries to channels
  • Airbyte - Pipe articles into your warehouse
  • GitHub - Trigger runs from commits
  • Google Drive - Export articles to Docs

You can also use webhooks to trigger summarization and alerting pipelines when new articles finish extracting.


💡 Pro Tip: browse the complete ParseForge collection for more content-extraction tools.


🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.


⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any publisher, news outlet, or readability library. Only publicly accessible article URLs are processed. Respect the copyright and terms of service of every publisher you extract from.