๐Ÿง  Smart Article Extractor avatar

๐Ÿง  Smart Article Extractor

Pricing

from $4.99 / 1,000 results

Go to Apify Store
๐Ÿง  Smart Article Extractor

๐Ÿง  Smart Article Extractor

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapier

Scrapier

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

๐Ÿง  Smart Article Extractor โ€” News & Blog Scraper

One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content โ€” title, author, publish date, full text, summary, images, videos, in-body links and rich metadata โ€” from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.


๐Ÿš€ Why Choose Us?

FeatureSmart Article ExtractorTypical 1-URL article scraper
Bulk discovery (BFS crawler)โœ… YesโŒ One URL at a time
Sitemap & robots.txt scanningโœ… Built-inโŒ
Sub-domain / sub-path scopingโœ… Per Start URLโŒ
onlyNewArticles cross-run dedupโœ… Per-domain & globalโŒ
Date filters (dateFrom, lastDays, mustHaveDate)โœ… All threeโš ๏ธ Limited
Anti-block proxy fallback (none โ†’ DC โ†’ RES)โœ… AutomaticโŒ
Optional Playwright renderingโœ… ToggleโŒ
Extend-output Python hookโœ… Inline snippetโŒ
Live dataset push + state KVSโœ…โš ๏ธ

๐Ÿ”ฅ Key Features

  • ๐Ÿ“ฐ Clean article extraction โ€” trafilatura + BeautifulSoup combo for high recall.
  • ๐ŸŒ Bulk discovery โ€” drop a homepage URL and the actor discovers articles via BFS.
  • ๐Ÿ—บ๏ธ Sitemap & robots.txt โ€” automatic Sitemap: parsing + common candidates.
  • ๐Ÿ›ก๏ธ Smart proxy fallback โ€” starts direct, then datacenter, then residential.
  • ๐ŸŽญ Headless browser mode โ€” Playwright + Chromium for JS-heavy or protected sites.
  • ๐Ÿง  Cross-run memory โ€” onlyNewArticles and onlyNewArticlesPerDomain.
  • ๐Ÿชœ Depth / page / article caps โ€” never over-crawl.
  • ๐Ÿ“… Date filters โ€” dateFrom, onlyArticlesForLastDays, mustHaveDate.
  • ๐Ÿ› ๏ธ extendOutputFunction โ€” inject your own Python extend(soup, article, html).
  • ๐Ÿ’พ Save HTML / snapshots โ€” full HTML in-record or as KVS link, PNG screenshots.

๐Ÿ“ฅ Input

FieldTypeDefaultDescription
startUrlsarrayrequiredHomepages, sections, topic pages โ€” used as crawl seeds.
articleUrlsarray[]Direct article URLs to extract (no discovery needed).
onlyNewArticlesbooleanfalseSkip URLs already seen in any previous run.
onlyNewArticlesPerDomainbooleanfalsePer-domain dedup memory.
onlyInsideArticlesbooleantrueEnqueue only same-domain links from articles.
onlySubdomainArticlesbooleanfalseRestrict to URLs sharing the Start URL path prefix.
enqueueFromArticlesbooleantrueDiscover further links inside extracted articles.
crawlWholeSubdomainbooleantrueTreat any same-subdomain link as a category candidate.
scanSitemapsbooleantrueDiscover articles from robots.txt and common sitemap paths.
useGoogleBotHeadersbooleantrueIdentify as Googlebot.
useBrowserbooleanfalseRender with headless Chromium.
scrollToBottombooleanfalseForce lazy-loaded content (browser mode only).
mustHaveDatebooleanfalseDrop articles with no detectable date.
dateFromstring (ISO date)โ€”Earliest article date.
onlyArticlesForLastDaysintegerโ€”Convenience cut-off.
minWordsinteger150Reject short articles.
maxDepthinteger2BFS depth.
maxPagesPerCrawlinteger50Hard cap on fetched pages.
maxArticlesPerCrawlinteger25Hard cap on saved articles.
maxArticlesPerStartUrlinteger25Cap per Start URL.
isUrlArticleDefinitionobjectsee schemaURL-shape heuristic.
linkSelectorstringโ€”CSS selector restricting where links are collected from.
pseudoUrlsarray[]Custom URL patterns for category pages.
sitemapUrlsarray[]Explicit sitemap URLs (skip auto-discovery).
saveHtmlbooleanfalseInclude raw HTML in the dataset record.
saveHtmlAsLinkbooleanfalseSave HTML to KVS and put a link in the record.
saveSnapshotsbooleanfalsePNG screenshot (browser mode only).
extendOutputFunctionstringโ€”Python snippet โ€” must define extend(soup, article, html) -> dict.
proxyConfigurationobject{useApifyProxy: false}Default = no proxy; auto-fallback to DC โ†’ RES if blocked.

Example input:

{
"startUrls": [{ "url": "https://www.theguardian.com" }],
"onlyArticlesForLastDays": 2,
"minWords": 150,
"maxArticlesPerCrawl": 5,
"useGoogleBotHeaders": true,
"scanSitemaps": true,
"proxyConfiguration": { "useApifyProxy": false }
}

๐Ÿ“ค Output

Each pushed record contains:

FieldTypeDescription
url, loadedUrlstringOriginal / resolved URL.
domain, loadedDomainstringBare host.
referrer, startUrlstringWhere the link was discovered.
depthintegerBFS depth at time of crawl.
title, softTitlestringBest-effort headline.
datestring (ISO)Publication date if found.
authorarrayAuthor URL(s) or name(s).
publisher, copyright, lang, favicon, canonicalLinkstringSite metadata.
description, keywordsstringMeta description / keywords.
tagsarrayarticle:tag values.
imagestringHero / OG image URL.
videosarray<video> / <iframe> / <source> URLs.
linksarray of {text, href}Inner-body links.
wordCountintegerWord count of the extracted text.
textstringCleaned article body.
htmlstringFull HTML (only if saveHtml / saveHtmlAsLink).
screenshotUrlstringKVS link (only if saveSnapshots + useBrowser).

Example output (truncated):

{
"url": "https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toiletโ€ฆ",
"domain": "theguardian.com",
"title": "How often should you go to the toilet?โ€ฆ",
"date": "2026-05-21T04:00:02.000Z",
"author": ["https://www.theguardian.com/profile/sarahphillips"],
"publisher": "the Guardian",
"wordCount": 1620,
"text": "Think balance, diversity and routine. \"Our gut is a complex machine,\" saysโ€ฆ",
"image": "https://i.guim.co.uk/img/media/โ€ฆ"
}

๐Ÿš€ How to Use (Apify Console)

  1. Log in at https://console.apify.com โ†’ Actors.
  2. Open Smart Article Extractor.
  3. Configure inputs (Start URLs, date filters, caps, proxy).
  4. Click Start.
  5. Watch logs in real time โ€” the actor prints a per-article live feed.
  6. Open the Output tab once the run completes.
  7. Export to JSON / CSV / XLSX or wire to a webhook.

๐Ÿค– Use via API / MCP

curl -X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": [{"url": "https://www.theguardian.com"}],
"maxArticlesPerCrawl": 5,
"onlyArticlesForLastDays": 2,
"proxyConfiguration": {"useApifyProxy": false}
}'

MCP-server tool name: smart-article-extractor.


๐Ÿ’ก Best Use Cases

  • ๐Ÿ“ฐ News monitoring on a topic / publisher
  • ๐Ÿ“Š NLP / sentiment / summarisation datasets
  • ๐Ÿ›๏ธ Brand or competitor coverage tracking
  • ๐Ÿ” SEO / SERP enrichment with full article text
  • ๐Ÿ“š Knowledge-base construction for RAG / LLMs
  • ๐Ÿ—ž๏ธ Press-clipping archives

๐Ÿ’ฐ Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.


โ“ Frequently Asked Questions

Q: Why are some articles skipped?
A: They failed at least one filter โ€” date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.

Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.

Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS โ€” if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

Q: Can I customise the output?
A: Yes โ€” supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.


๐Ÿ›Ÿ Support & Feedback

Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.


  • Data is collected only from publicly available sources.
  • Do not scrape private accounts or content behind authentication unless explicitly authorised.
  • The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
  • The actor honours robots.txt for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs โ€” please be a good citizen.