Smart Article Extractor
Try for free
No credit card required
View all Actors
Smart Article Extractor
lukaskrivka/article-extractor-smart
Try for free
No credit card required
📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.
Do you want to learn more about this Actor?
Get a demo2024-03-21
Features
- Add
navigationWaitUntil
input option for browser to allow faster or slower loading depending on the use-case
2023-09-12
Features
- Add
maxArticlesPerStartUrl
to input to limit the number of articles per start URL
2023-08-03
Features
- Add
onlyArticlesForLastDays
to input for easier dynamic date filtering
2023-03-27
Changes
snapshotUrls
output have been replaced byscreenshotUrl
extendOutputFunction
is run after all fields were assigned forfull control
Fixes
extendOutputFunction
now correctly works with undefined fields for browser
2023-03-20
Features
- Add
crawlWholeSubdomain
to input so you don't need to set pseudoUrls or linkSelector - Add
onlySubdomainArticles
to input to limit articles and enqueueing to the subdomain of the start URL - Add
saveHtmlAsLink
to input to save HTML of articles as a link in the output - Add
referrer
,startUrl
anddepth
to output
2023-03-01
Features
- Update SDK to version 3
2022-10-13
Features
- Deprecate
saveSnapshotsOfInvalidArticles
input field in favor of newsaveSnapshots
input field that save for all articles. - Deprecate
pageWaitSelector
and instead addpageWaitSelectorCategory
andpageWaitSelectorArticle
inputs
2022-09-29
Features
- Added infinite scroll feature for browsers with 3 inputs:
scrollToBottom
,scrollToBottomButtonSelector
,scrollToBottomMaxSecs
2022-09-21
Features
- Nicer messages explaining why an article was marked as invalid
- Added
saveSnapshotsOfInvalidArticles
option to input
2021-6-17
Features
- Added
enqueueFromArticles
option to enqueue articles from article pages to get even more articles from the website. You need to enable it in input. - Added
scanSitemaps
andsitemapUrls
parameters.scanSitemaps
automatically searches sitemaps for articles for each start URL andsitemapUrls
allows you to add the sitemaps manually if necessary. Be careful thatscanSitemaps
may dump a huge amount of (sometimes old) article URLs into the scraping process
2021-03-12
Fixes
onlyNewArticles
andonlyNewArticlesPerDomain
was loading duplicate items which caused excess usage of dataset read.
2021-03-31
Features
- Added new input option
onlyNewArticlesPerDomain
. This is much more efficient way to deduplicate articles, so use it instead ofonlyNewArticles
. onlyNewArticlesPerDomain
works also on local datasets
2021-01-21
- Fix: Now works with Start URLs from a public spreadsheet
2020-09-28
- Upgraded Apify version
0.21.0
that sometimes crashed at the start of the run - Added
currentItem
param toextendOutputFunction
- Improved logs
- Increased request timeouts to work better on very slow sites
2020-07-07
- Added option to run with browser (Puppeteer)
- Added option to wait for page load or for selector (browser only)
- Added
articleUrls
directly as input option to parse directly on articles
Developer
Maintained by Apify
Actor Metrics
200 monthly users
-
65 stars
>99% runs succeeded
1.2 days response time
Created in Nov 2019
Modified 4 months ago