Smart Article Extractor
Pricing
from $40.00 / 1,000 results
Smart Article Extractor
Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!
Pricing
from $40.00 / 1,000 results
Rating
0.0
(0)
Developer
ParseForge
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 hours ago
Last modified
Categories
Share

📰 Smart Article Extractor
🚀 Parse any news article or blog post into clean structured text in seconds. Get 23 metadata fields per article including authors, tags, publish date, lead image, paywall flag, and reading time. No API key, no registration, no manual parser maintenance.
🕒 Last updated: 2026-04-21 · 📊 23 fields per article · 🌐 Works on any site · ⚡ 10 articles in ~10 seconds · 💰 Paywall detection
The Smart Article Extractor takes any article URL and returns the main body as clean Markdown alongside 22 metadata fields. It scores DOM nodes by paragraph count, word count, and link density to identify the main content block, then strips navigation, sidebars, and ads. Author, tags, section, publishedAt, modifiedAt, and canonical URL are pulled from meta tags, JSON-LD, and itemprop attributes.
Extras include a paywall-detection heuristic, inline image collection, lead image (Open Graph), language detection, word count, and reading time. Concurrent fetching keeps 10 articles flying in parallel, so a list of 100 news URLs finishes in about 15 seconds. Works out of the box on most major news sites, blogs, and publishing platforms.
| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| News aggregators, media monitoring teams, AI app developers, content researchers, data journalists, archivists | News datasets, summarization pipelines, media monitoring, sentiment analysis, archive assembly |
📋 What the Smart Article Extractor does
Five extraction workflows in a single run:
- 📝 Main body extraction. DOM scoring isolates the article content and strips navigation, ads, and sidebars.
- 👥 Author detection. Pulls authors from meta tags, JSON-LD, and itemprop attributes.
- 📅 Date stamps. Captures both
article:published_timeandarticle:modified_time. - 🏷️ Tags and section. Extracts
article:tagandarticle:sectionmetadata. - 💰 Paywall flag. Heuristic detects common paywall markers so you can filter downstream.
Every record also includes the canonical URL, lead image, inline images, word count, reading time, language, site name, HTTP status, and timestamp.
💡 Why it matters: news sites each have their own HTML structure. Writing per-site parsers is brittle and breaks every time a publisher redesigns their pages. This Actor uses readability-style scoring that works across any article-shaped page.
🎬 Full Demo
🚧 Coming soon: a 3-minute walkthrough showing extraction across news sites, blogs, and platforms.
⚙️ Input
| Input | Type | Default | Behavior |
|---|---|---|---|
startUrls | array of URLs | required | One or more article URLs to extract. |
maxItems | integer | 10 | Articles returned. Free plan caps at 10, paid plan at 1,000,000. |
Example: extract a single article.
{"startUrls": [{ "url": "https://techcrunch.com/2025/01/10/openai-launches-gpt-store/" }],"maxItems": 1}
Example: batch extraction for media monitoring.
{"startUrls": [{ "url": "https://www.theverge.com/2025/ai-coverage-1" },{ "url": "https://www.wired.com/story/ai-agents-2026" },{ "url": "https://arstechnica.com/ai/article" }],"maxItems": 100}
⚠️ Good to Know: works best on article-shaped pages (one headline, one author, one body). Homepages, category pages, and list views return thin extractions because there is no single article to score.
📊 Output
Each record contains 23 fields. Download the dataset as CSV, Excel, JSON, or XML.
🧾 Schema
| Field | Type | Example |
|---|---|---|
🔗 url | string | "https://techcrunch.com/.../gpt-store/" |
🔁 canonicalUrl | string | null | "https://techcrunch.com/.../gpt-store/" |
🏷️ title | string | null | "OpenAI launches GPT Store" |
📑 subtitle | string | null | "Available to Plus, Team, Enterprise" |
🧑 author | string | null | "Kyle Wiggers" |
👥 authors | string[] | ["Kyle Wiggers"] |
📅 publishedAt | ISO 8601 | null | "2025-01-10T14:00:00Z" |
🔁 modifiedAt | ISO 8601 | null | "2025-01-10T16:30:00Z" |
🏢 siteName | string | null | "TechCrunch" |
🗂️ section | string | null | "AI" |
🏷️ tags | string[] | ["openai", "gpt-store"] |
🌍 language | string | null | "en-US" |
📝 description | string | null | "OpenAI rolled out the long-teased GPT Store..." |
🖼️ leadImage | string | null | "https://.../og.jpg" |
🎨 images | string[] | ["https://...", "https://..."] |
📃 markdown | string | "# OpenAI launches GPT Store..." |
💬 text | string | plain text without markdown markers |
🧾 html | string | cleaned article HTML |
🔢 wordCount | number | 742 |
⏱️ readingTimeMinutes | number | 4 |
💰 hasPaywall | boolean | false |
🟢 httpStatus | number | 200 |
🕒 scrapedAt | ISO 8601 | "2026-04-21T12:00:00.000Z" |
❗ error | string | null | "Timeout" on failure |
📦 Sample records
✨ Why choose this Actor
| Capability | |
|---|---|
| 🧠 | DOM scoring. Readability-style extraction works across any article-shaped page without per-site rules. |
| 📊 | 23 fields. Authors, tags, section, dates, images, paywall, reading time, and canonical URL. |
| 💰 | Paywall detection. Flags articles likely behind a paywall so you can filter them out. |
| ⚡ | Fast. 10 articles in under 10 seconds with parallel fetching. |
| 🖼️ | Image capture. Lead image plus every inline image URL in the article body. |
| 🚫 | No credentials. Runs on any public article URL. |
| 🔌 | Integrations. Plugs into RSS feeds, newsroom tools, and news datasets. |
📊 Clean article text is the foundation of news summarization, sentiment analysis, and media monitoring. This Actor delivers it consistently without per-site parsers.
📈 How it compares to alternatives
| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| ⭐ Smart Article Extractor (this Actor) | $5 free credit, then pay-per-use | Any public article URL | Live per run | 23 metadata fields | ⚡ 2 min |
| Open-source readability libs | Free | Whatever you host | Your code | Whatever you build | 🐢 Days |
| News API services | $99+/month | Curated feeds | Real-time | Per-plan limits | ⏳ Hours |
| Paid media monitoring | $$$+/month | Managed sources | Real-time | Rich UI | 🕒 Variable |
Pick this Actor when you want article text from arbitrary URLs without maintaining your own extraction library.
🚀 How to use
- 📝 Sign up. Create a free account with $5 credit (takes 2 minutes).
- 🌐 Open the Actor. Go to the Smart Article Extractor page on the Apify Store.
- 🎯 Paste URLs. Add article URLs to the
startUrlsfield and setmaxItems. - 🚀 Run it. Click Start and let the Actor extract the content.
- 📥 Download. Grab your results in the Dataset tab as CSV, Excel, JSON, or XML.
⏱️ Total time from signup to downloaded dataset: 3-5 minutes. No coding required.
💼 Business use cases
🔌 Automating Smart Article Extractor
Control the scraper programmatically for scheduled runs and pipeline integrations:
- 🟢 Node.js. Install the
apify-clientNPM package. - 🐍 Python. Use the
apify-clientPyPI package. - 📚 See the Apify API documentation for full details.
The Apify Schedules feature lets you trigger this Actor on any cron interval. Pair it with an RSS reader or Google News feed for continuous media monitoring.
❓ Frequently Asked Questions
🔌 Integrate with any app
Smart Article Extractor connects to any cloud service via Apify integrations:
- Make - Automate multi-step workflows
- Zapier - Connect with 5,000+ apps
- Slack - Post article summaries to channels
- Airbyte - Pipe articles into your warehouse
- GitHub - Trigger runs from commits
- Google Drive - Export articles to Docs
You can also use webhooks to trigger summarization and alerting pipelines when new articles finish extracting.
🔗 Recommended Actors
- 🤖 RAG Web Browser - Search or fetch URLs with LLM-ready output
- 🕸️ Website Content Crawler - Deep-crawl a domain with depth control
- 🔍 Google Search Scraper - SERP results with rank and description
- 📈 Google Trends Scraper - Interest over time and related queries
- 📧 Contact Info Scraper - Emails, phones, and socials from URLs
💡 Pro Tip: browse the complete ParseForge collection for more content-extraction tools.
🆘 Need Help? Open our contact form to request a new scraper, propose a custom data project, or report an issue.
⚠️ Disclaimer: this Actor is an independent tool and is not affiliated with any publisher, news outlet, or readability library. Only publicly accessible article URLs are processed. Respect the copyright and terms of service of every publisher you extract from.