Smart Article Extractor avatar
Smart Article Extractor
Try for free

No credit card required

View all Actors
Smart Article Extractor

Smart Article Extractor

lukaskrivka/article-extractor-smart
Try for free

No credit card required

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

2024-03-21

Features

  • Add navigationWaitUntil input option for browser to allow faster or slower loading depending on the use-case

2023-09-12

Features

  • Add maxArticlesPerStartUrl to input to limit the number of articles per start URL

2023-08-03

Features

  • Add onlyArticlesForLastDays to input for easier dynamic date filtering

2023-03-27

Changes

  • snapshotUrls output have been replaced by screenshotUrl
  • extendOutputFunction is run after all fields were assigned forfull control

Fixes

  • extendOutputFunction now correctly works with undefined fields for browser

2023-03-20

Features

  • Add crawlWholeSubdomain to input so you don't need to set pseudoUrls or linkSelector
  • Add onlySubdomainArticles to input to limit articles and enqueueing to the subdomain of the start URL
  • Add saveHtmlAsLink to input to save HTML of articles as a link in the output
  • Add referrer, startUrl and depth to output

2023-03-01

Features

  • Update SDK to version 3

2022-10-13

Features

  • Deprecate saveSnapshotsOfInvalidArticles input field in favor of new saveSnapshots input field that save for all articles.
  • Deprecate pageWaitSelector and instead add pageWaitSelectorCategory and pageWaitSelectorArticle inputs

2022-09-29

Features

  • Added infinite scroll feature for browsers with 3 inputs: scrollToBottom, scrollToBottomButtonSelector, scrollToBottomMaxSecs

2022-09-21

Features

  • Nicer messages explaining why an article was marked as invalid
  • Added saveSnapshotsOfInvalidArticles option to input

2021-6-17

Features

  • Added enqueueFromArticles option to enqueue articles from article pages to get even more articles from the website. You need to enable it in input.
  • Added scanSitemaps and sitemapUrls parameters. scanSitemaps automatically searches sitemaps for articles for each start URL and sitemapUrls allows you to add the sitemaps manually if necessary. Be careful that scanSitemaps may dump a huge amount of (sometimes old) article URLs into the scraping process

2021-03-12

Fixes

  • onlyNewArticles and onlyNewArticlesPerDomain was loading duplicate items which caused excess usage of dataset read.

2021-03-31

Features

  • Added new input option onlyNewArticlesPerDomain. This is much more efficient way to deduplicate articles, so use it instead of onlyNewArticles.
  • onlyNewArticlesPerDomain works also on local datasets

2021-01-21

  • Fix: Now works with Start URLs from a public spreadsheet

2020-09-28

  • Upgraded Apify version 0.21.0 that sometimes crashed at the start of the run
  • Added currentItem param to extendOutputFunction
  • Improved logs
  • Increased request timeouts to work better on very slow sites

2020-07-07

  • Added option to run with browser (Puppeteer)
  • Added option to wait for page load or for selector (browser only)
  • Added articleUrls directly as input option to parse directly on articles
Developer
Maintained by Apify
Actor metrics
  • 172 monthly users
  • 73.8% runs succeeded
  • 2.8 days response time
  • Created in Nov 2019
  • Modified about 1 month ago
Categories