Actor picture

Smart Article Extractor

lukaskrivka/article-extractor-smart

馃摪聽Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

No credit card required

Author's avatarLuk谩拧 K艡ivka
  • Modified
  • Users899
  • Runs96,511
Actor picture
Smart Article Extractor

Website/category URLs

startUrls

Optional

array

These could be the main page URL or any category/subpage URL, e.g. https://www.bbc.com/. Article pages are detected and crawled from these. If you prefer to use direct article URLs, use `articleUrls` input instead

Article URLs

articleUrls

Optional

array

These are direct URLs for the articles to be extracted, e.g. https://www.bbc.com/news/uk-62836057. No extra pages are crawled from article pages.

Only new articles (only for small runs)

onlyNewArticles

Optional

boolean

This option is only viable for smaller runs. If you plan to use this on a large scale, use the 'Only new articles (saved per domain)' option below instead. If this function is selected, the extractor will only scrape new articles each time you run it. (Scraped URLs are saved in a dataset named `articles-state`, and are compared with new ones.)

Only new articles (saved per domain, preferable)

onlyNewArticlesPerDomain

Optional

boolean

If this function is selected, the extractor will only scrape only new articles each time you run it. (Scraped articles are saved in one dataset, named 'ARTICLES-SCRAPED-domain', per each domain, and compared with new ones.)

Only inside domain articles

onlyInsideArticles

Optional

boolean

If this function is selected, the extractor will only scrape articles that are on the domain from where they are linked. If the domain presents links to articles on different domains, those articles will not be scraped, e.g. https://www.bbc.com/ vs. https://www.bbc.co.uk/.

Enqueue articles from articles

enqueueFromArticles

Optional

boolean

Normally, the scraper only extracts articles from category pages. This option allows the scraper to also extract articles linked within articles.

Find articles in sitemaps (caution)

scanSitemaps

Optional

boolean

We recommend using `Sitemap URLs` instead. If this function is selected, the extractor will scan different sitemaps from the initial article URL. Keep in mind that this option can lead to the loading of a huge amount of (sometimes old) articles, in which case the time and cost of the scrape will increase.

Sitemap URLs (safer)

sitemapUrls

Optional

array

You can provide selected sitemap URLs that include the articles you need to extract.

Save full HTML

saveHtml

Optional

boolean

If this function is selected, the scraper will save the full HTML of the article page, but this will make the data less readable.

Save HTML and screenshots of article pages

saveSnapshots

Optional

boolean

Stores HTML and screenshot for each article page to Key-Value Store. Useful for debugging.

Use Googlebot headers

useGoogleBotHeaders

Optional

boolean

This option will allow you to bypass protection and paywalls on some websites. Use with caution as it might lead to getting blocked.

Minimum words

minWords

Optional

integer

The article needs to contain at least this number of words to be extracted

Extract articles from [date]

dateFrom

Optional

string

Only articles from this day on will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or number type e.g. 1 week or 20 days

Must have date

mustHaveDate

Optional

boolean

If checked, the article must have a date of release to be extracted.

Is the URL an article?

isUrlArticleDefinition

Optional

object

Here you can input JSON settings to define what URLs should be considered articles by the scraper. If any of them is `true`, then the link will be opened and the article extracted.

Pseudo URLs

pseudoUrls

Optional

array

This function can be used to enqueue more pages, i.e. include more links like pagination or categories. This doesn't work for articles, as they are recognized by the recognition system.

Max depth

maxDepth

Optional

integer

Maximum depth of crawling, i.e. how many times the scraper picks up a link to other webpages. Level 0 refers to the start URLs, 1 are the first level links, and so on. This is only valid for pseudo URLs

Max pages per crawl

maxPagesPerCrawl

Optional

integer

Maximum number of total pages crawled. It includes the home page, pagination pages, invalid articles, and so on. The crawler will stop automatically after reaching this number.

Max articles per crawl

maxArticlesPerCrawl

Optional

integer

Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number.

Max concurrency

maxConcurrency

Optional

integer

You can limit the speed of the scraper to avoid getting blocked.

Proxy configuration

proxyConfiguration

Optional

object

Proxy configuration

Use browser (Puppeteer)

useBrowser

Optional

boolean

This option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.

Wait on each page (ms)

pageWaitMs

Optional

integer

How many milliseconds to wait on each page before extracting data

Wait for selector on each category page

pageWaitSelectorCategory

Optional

string

For what selector to wait on each page before extracting data

Wait for selector on each article page

pageWaitSelectorArticle

Optional

string

For what selector to wait on each page before extracting data

Scroll to bottom of the page (infinite scroll)

scrollToBottom

Optional

boolean

Scroll to the botton of the page, loading dynamic articles.

Scroll to bottom button selector

scrollToBottomButtonSelector

Optional

string

CSS selector for a button to load more articles

Scroll to bottom max seconds

scrollToBottomMaxSecs

Optional

integer

Limit for how long the scrolling can run so it does no go infinite.

Extend output function

extendOutputFunction

Optional

string

This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.

Limit CU consumption

stopAfterCUs

Optional

integer

The scraper will stop running after reaching this number of compute units.

Emails address for notifications

notificationEmails

Optional

array

Notifications will be sent to these email addresses.

Notify after [number] CUs

notifyAfterCUs

Optional

integer

The scraper will send notifications to the provided email when it reaches this number of CUs.

Notify every [number] CUs

notifyAfterCUsPeriodically

Optional

integer

The scraper will send notifications to the provided email every time this number of CUs is reached since the last notification.