๐ง Smart Article Extractor
Pricing
from $4.99 / 1,000 results
๐ง Smart Article Extractor
Pricing
from $4.99 / 1,000 results
Rating
0.0
(0)
Developer
Scrapier
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
๐ง Smart Article Extractor โ News & Blog Scraper
One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content โ title, author, publish date, full text, summary, images, videos, in-body links and rich metadata โ from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.
๐ Why Choose Us?
| Feature | Smart Article Extractor | Typical 1-URL article scraper |
|---|---|---|
| Bulk discovery (BFS crawler) | โ Yes | โ One URL at a time |
| Sitemap & robots.txt scanning | โ Built-in | โ |
| Sub-domain / sub-path scoping | โ Per Start URL | โ |
onlyNewArticles cross-run dedup | โ Per-domain & global | โ |
Date filters (dateFrom, lastDays, mustHaveDate) | โ All three | โ ๏ธ Limited |
| Anti-block proxy fallback (none โ DC โ RES) | โ Automatic | โ |
| Optional Playwright rendering | โ Toggle | โ |
| Extend-output Python hook | โ Inline snippet | โ |
| Live dataset push + state KVS | โ | โ ๏ธ |
๐ฅ Key Features
- ๐ฐ Clean article extraction โ trafilatura + BeautifulSoup combo for high recall.
- ๐ Bulk discovery โ drop a homepage URL and the actor discovers articles via BFS.
- ๐บ๏ธ Sitemap & robots.txt โ automatic
Sitemap:parsing + common candidates. - ๐ก๏ธ Smart proxy fallback โ starts direct, then datacenter, then residential.
- ๐ญ Headless browser mode โ Playwright + Chromium for JS-heavy or protected sites.
- ๐ง Cross-run memory โ
onlyNewArticlesandonlyNewArticlesPerDomain. - ๐ช Depth / page / article caps โ never over-crawl.
- ๐
Date filters โ
dateFrom,onlyArticlesForLastDays,mustHaveDate. - ๐ ๏ธ
extendOutputFunctionโ inject your own Pythonextend(soup, article, html). - ๐พ Save HTML / snapshots โ full HTML in-record or as KVS link, PNG screenshots.
๐ฅ Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Homepages, sections, topic pages โ used as crawl seeds. |
articleUrls | array | [] | Direct article URLs to extract (no discovery needed). |
onlyNewArticles | boolean | false | Skip URLs already seen in any previous run. |
onlyNewArticlesPerDomain | boolean | false | Per-domain dedup memory. |
onlyInsideArticles | boolean | true | Enqueue only same-domain links from articles. |
onlySubdomainArticles | boolean | false | Restrict to URLs sharing the Start URL path prefix. |
enqueueFromArticles | boolean | true | Discover further links inside extracted articles. |
crawlWholeSubdomain | boolean | true | Treat any same-subdomain link as a category candidate. |
scanSitemaps | boolean | true | Discover articles from robots.txt and common sitemap paths. |
useGoogleBotHeaders | boolean | true | Identify as Googlebot. |
useBrowser | boolean | false | Render with headless Chromium. |
scrollToBottom | boolean | false | Force lazy-loaded content (browser mode only). |
mustHaveDate | boolean | false | Drop articles with no detectable date. |
dateFrom | string (ISO date) | โ | Earliest article date. |
onlyArticlesForLastDays | integer | โ | Convenience cut-off. |
minWords | integer | 150 | Reject short articles. |
maxDepth | integer | 2 | BFS depth. |
maxPagesPerCrawl | integer | 50 | Hard cap on fetched pages. |
maxArticlesPerCrawl | integer | 25 | Hard cap on saved articles. |
maxArticlesPerStartUrl | integer | 25 | Cap per Start URL. |
isUrlArticleDefinition | object | see schema | URL-shape heuristic. |
linkSelector | string | โ | CSS selector restricting where links are collected from. |
pseudoUrls | array | [] | Custom URL patterns for category pages. |
sitemapUrls | array | [] | Explicit sitemap URLs (skip auto-discovery). |
saveHtml | boolean | false | Include raw HTML in the dataset record. |
saveHtmlAsLink | boolean | false | Save HTML to KVS and put a link in the record. |
saveSnapshots | boolean | false | PNG screenshot (browser mode only). |
extendOutputFunction | string | โ | Python snippet โ must define extend(soup, article, html) -> dict. |
proxyConfiguration | object | {useApifyProxy: false} | Default = no proxy; auto-fallback to DC โ RES if blocked. |
Example input:
{"startUrls": [{ "url": "https://www.theguardian.com" }],"onlyArticlesForLastDays": 2,"minWords": 150,"maxArticlesPerCrawl": 5,"useGoogleBotHeaders": true,"scanSitemaps": true,"proxyConfiguration": { "useApifyProxy": false }}
๐ค Output
Each pushed record contains:
| Field | Type | Description |
|---|---|---|
url, loadedUrl | string | Original / resolved URL. |
domain, loadedDomain | string | Bare host. |
referrer, startUrl | string | Where the link was discovered. |
depth | integer | BFS depth at time of crawl. |
title, softTitle | string | Best-effort headline. |
date | string (ISO) | Publication date if found. |
author | array | Author URL(s) or name(s). |
publisher, copyright, lang, favicon, canonicalLink | string | Site metadata. |
description, keywords | string | Meta description / keywords. |
tags | array | article:tag values. |
image | string | Hero / OG image URL. |
videos | array | <video> / <iframe> / <source> URLs. |
links | array of {text, href} | Inner-body links. |
wordCount | integer | Word count of the extracted text. |
text | string | Cleaned article body. |
html | string | Full HTML (only if saveHtml / saveHtmlAsLink). |
screenshotUrl | string | KVS link (only if saveSnapshots + useBrowser). |
Example output (truncated):
{"url": "https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toiletโฆ","domain": "theguardian.com","title": "How often should you go to the toilet?โฆ","date": "2026-05-21T04:00:02.000Z","author": ["https://www.theguardian.com/profile/sarahphillips"],"publisher": "the Guardian","wordCount": 1620,"text": "Think balance, diversity and routine. \"Our gut is a complex machine,\" saysโฆ","image": "https://i.guim.co.uk/img/media/โฆ"}
๐ How to Use (Apify Console)
- Log in at https://console.apify.com โ Actors.
- Open Smart Article Extractor.
- Configure inputs (Start URLs, date filters, caps, proxy).
- Click Start.
- Watch logs in real time โ the actor prints a per-article live feed.
- Open the Output tab once the run completes.
- Export to JSON / CSV / XLSX or wire to a webhook.
๐ค Use via API / MCP
curl -X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{"url": "https://www.theguardian.com"}],"maxArticlesPerCrawl": 5,"onlyArticlesForLastDays": 2,"proxyConfiguration": {"useApifyProxy": false}}'
MCP-server tool name: smart-article-extractor.
๐ก Best Use Cases
- ๐ฐ News monitoring on a topic / publisher
- ๐ NLP / sentiment / summarisation datasets
- ๐๏ธ Brand or competitor coverage tracking
- ๐ SEO / SERP enrichment with full article text
- ๐ Knowledge-base construction for RAG / LLMs
- ๐๏ธ Press-clipping archives
๐ฐ Pricing
Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.
โ Frequently Asked Questions
Q: Why are some articles skipped?
A: They failed at least one filter โ date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.
Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.
Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.
Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS โ if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.
Q: Can I customise the output?
A: Yes โ supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.
๐ Support & Feedback
Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.
โ๏ธ Cautions / legal
- Data is collected only from publicly available sources.
- Do not scrape private accounts or content behind authentication unless explicitly authorised.
- The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
- The actor honours
robots.txtfor sitemap discovery; it does not enforce robots.txt blocks on crawl URLs โ please be a good citizen.