🧠 Smart Article Extractor

Try for free

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapier

Actor stats

Bookmarked

Total users

Monthly active users

15 days ago

Last modified

🧠 Smart Article Extractor — News & Blog Scraper

One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content — title, author, publish date, full text, summary, images, videos, in-body links and rich metadata — from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.

🚀 Why Choose Us?

Feature	Smart Article Extractor	Typical 1-URL article scraper
Bulk discovery (BFS crawler)	✅ Yes	❌ One URL at a time
Sitemap & robots.txt scanning	✅ Built-in	❌
Sub-domain / sub-path scoping	✅ Per Start URL	❌
`onlyNewArticles` cross-run dedup	✅ Per-domain & global	❌
Date filters (`dateFrom`, `lastDays`, `mustHaveDate`)	✅ All three	⚠️ Limited
Anti-block proxy fallback (none → DC → RES)	✅ Automatic	❌
Optional Playwright rendering	✅ Toggle	❌
Extend-output Python hook	✅ Inline snippet	❌
Live dataset push + state KVS	✅	⚠️

🔥 Key Features

📰 Clean article extraction — trafilatura + BeautifulSoup combo for high recall.
🌐 Bulk discovery — drop a homepage URL and the actor discovers articles via BFS.
🗺️ Sitemap & robots.txt — automatic Sitemap: parsing + common candidates.
🛡️ Smart proxy fallback — starts direct, then datacenter, then residential.
🎭 Headless browser mode — Playwright + Chromium for JS-heavy or protected sites.
🧠 Cross-run memory — onlyNewArticles and onlyNewArticlesPerDomain.
🪜 Depth / page / article caps — never over-crawl.
📅 Date filters — dateFrom, onlyArticlesForLastDays, mustHaveDate.
🛠️ extendOutputFunction — inject your own Python extend(soup, article, html).
💾 Save HTML / snapshots — full HTML in-record or as KVS link, PNG screenshots.

📥 Input

Field	Type	Default	Description
`startUrls`	array	required	Homepages, sections, topic pages — used as crawl seeds.
`articleUrls`	array	`[]`	Direct article URLs to extract (no discovery needed).
`onlyNewArticles`	boolean	`false`	Skip URLs already seen in any previous run.
`onlyNewArticlesPerDomain`	boolean	`false`	Per-domain dedup memory.
`onlyInsideArticles`	boolean	`true`	Enqueue only same-domain links from articles.
`onlySubdomainArticles`	boolean	`false`	Restrict to URLs sharing the Start URL path prefix.
`enqueueFromArticles`	boolean	`true`	Discover further links inside extracted articles.
`crawlWholeSubdomain`	boolean	`true`	Treat any same-subdomain link as a category candidate.
`scanSitemaps`	boolean	`true`	Discover articles from `robots.txt` and common sitemap paths.
`useGoogleBotHeaders`	boolean	`true`	Identify as Googlebot.
`useBrowser`	boolean	`false`	Render with headless Chromium.
`scrollToBottom`	boolean	`false`	Force lazy-loaded content (browser mode only).
`mustHaveDate`	boolean	`false`	Drop articles with no detectable date.
`dateFrom`	string (ISO date)	—	Earliest article date.
`onlyArticlesForLastDays`	integer	—	Convenience cut-off.
`minWords`	integer	`150`	Reject short articles.
`maxDepth`	integer	`2`	BFS depth.
`maxPagesPerCrawl`	integer	`50`	Hard cap on fetched pages.
`maxArticlesPerCrawl`	integer	`25`	Hard cap on saved articles.
`maxArticlesPerStartUrl`	integer	`25`	Cap per Start URL.
`isUrlArticleDefinition`	object	see schema	URL-shape heuristic.
`linkSelector`	string	—	CSS selector restricting where links are collected from.
`pseudoUrls`	array	`[]`	Custom URL patterns for category pages.
`sitemapUrls`	array	`[]`	Explicit sitemap URLs (skip auto-discovery).
`saveHtml`	boolean	`false`	Include raw HTML in the dataset record.
`saveHtmlAsLink`	boolean	`false`	Save HTML to KVS and put a link in the record.
`saveSnapshots`	boolean	`false`	PNG screenshot (browser mode only).
`extendOutputFunction`	string	—	Python snippet — must define `extend(soup, article, html) -> dict`.
`proxyConfiguration`	object	`{useApifyProxy: false}`	Default = no proxy; auto-fallback to DC → RES if blocked.

Example input:

{
  "startUrls": [{ "url": "https://www.theguardian.com" }],
  "onlyArticlesForLastDays": 2,
  "minWords": 150,
  "maxArticlesPerCrawl": 5,
  "useGoogleBotHeaders": true,
  "scanSitemaps": true,
  "proxyConfiguration": { "useApifyProxy": false }
}

📤 Output

Each pushed record contains:

Field	Type	Description
`url`, `loadedUrl`	string	Original / resolved URL.
`domain`, `loadedDomain`	string	Bare host.
`referrer`, `startUrl`	string	Where the link was discovered.
`depth`	integer	BFS depth at time of crawl.
`title`, `softTitle`	string	Best-effort headline.
`date`	string (ISO)	Publication date if found.
`author`	array	Author URL(s) or name(s).
`publisher`, `copyright`, `lang`, `favicon`, `canonicalLink`	string	Site metadata.
`description`, `keywords`	string	Meta description / keywords.
`tags`	array	`article:tag` values.
`image`	string	Hero / OG image URL.
`videos`	array	`<video> / <iframe> / <source>` URLs.
`links`	array of `{text, href}`	Inner-body links.
`wordCount`	integer	Word count of the extracted text.
`text`	string	Cleaned article body.
`html`	string	Full HTML (only if `saveHtml` / `saveHtmlAsLink`).
`screenshotUrl`	string	KVS link (only if `saveSnapshots` + `useBrowser`).

Example output (truncated):

{
  "url": "https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…",
  "domain": "theguardian.com",
  "title": "How often should you go to the toilet?…",
  "date": "2026-05-21T04:00:02.000Z",
  "author": ["https://www.theguardian.com/profile/sarahphillips"],
  "publisher": "the Guardian",
  "wordCount": 1620,
  "text": "Think balance, diversity and routine. \"Our gut is a complex machine,\" says…",
  "image": "https://i.guim.co.uk/img/media/…"
}

🚀 How to Use (Apify Console)

Log in at https://console.apify.com → Actors.
Open Smart Article Extractor.
Configure inputs (Start URLs, date filters, caps, proxy).
Click Start.
Watch logs in real time — the actor prints a per-article live feed.
Open the Output tab once the run completes.
Export to JSON / CSV / XLSX or wire to a webhook.

🤖 Use via API / MCP

curl -X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "startUrls": [{"url": "https://www.theguardian.com"}],
       "maxArticlesPerCrawl": 5,
       "onlyArticlesForLastDays": 2,
       "proxyConfiguration": {"useApifyProxy": false}
     }'

MCP-server tool name: smart-article-extractor.

💡 Best Use Cases

📰 News monitoring on a topic / publisher
📊 NLP / sentiment / summarisation datasets
🏛️ Brand or competitor coverage tracking
🔍 SEO / SERP enrichment with full article text
📚 Knowledge-base construction for RAG / LLMs
🗞️ Press-clipping archives

💰 Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.

❓ Frequently Asked Questions

Q: Why are some articles skipped?
A: They failed at least one filter — date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.

Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.

Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS — if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

Q: Can I customise the output?
A: Yes — supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.

🛟 Support & Feedback

Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.

⚖️ Cautions / legal

Data is collected only from publicly available sources.
Do not scrape private accounts or content behind authentication unless explicitly authorised.
The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
The actor honours robots.txt for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs — please be a good citizen.