Actor picture

Smart Article Extractor

lukaskrivka/article-extractor-smart

Article Extractor scrapes detailed data from all articles on any website. It automatically recognizes what page is an article. It can be used for extracting news from BBC, CNN, Bloomberg, and other popular news websites.

Author's avatarLukáš Křivka
  • Modified
  • Users354
  • Runs48,127
Actor picture

Smart Article Extractor

Category URLs

startUrls

Optional

array

Can be main page URL or any category URLs. Article pages are found and enqueued from these. If you want direct article URLs, use `articleUrls` input instead

Article URLs

articleUrls

Optional

array

Direct URLs to the articles to be parsed. No extra pages are enqueued from article pages.

Only new articles (only for small runs)

onlyNewArticles

Optional

boolean

This option is only viable for smaller runs. If you plan to use this at large scale, use `onlyNewArticlesPerDomain` instead. If true, will scrape only new articles each you run it. All URLs you scraped are saved in dataset called `articles-state` and are compared with new ones.

Only new articles (saved per domain, preferable)

onlyNewArticlesPerDomain

Optional

boolean

If true, will scrape only new articles each you run it. All URLs you scraped ar and are compared with new ones. Scraped articles are saved in one dataset per each domain, datasets are named 'ARTICLES-SCRAPED-domain'

Only inside domain articles

onlyInsideArticles

Optional

boolean

If true, will scrape only articles that are on the domain from where they are linked.

Enqueue articles from articles

enqueueFromArticles

Optional

boolean

Normally, the scrapers only enqueues from category pages. This option can help to gather more articles per run.

Find articles in sitemaps (dangerous)

scanSitemaps

Optional

boolean

Will scan different sitemaps of the first URL for articles. Be very careful with this option as it can load a huge amount of (sometimes old) articles and the scrape time/cost will raise.

Sitemap URLs (safer)

sitemapUrls

Optional

array

Optionally you can also provide chosen sitemap URLs that have the articles you need to extract.

Save full HTML

saveHtml

Optional

boolean

Saves full HTML of the article page but makes data less readable.

Use Google Bot headers

useGoogleBotHeaders

Optional

boolean

This option will allow you to bypass protection and/or paywall on some sites. Use with caution as it might get blocked.

Minumum words

minWords

Optional

integer

Article need to contain at least this amount of words to be extracted

Date from

dateFrom

Optional

string

Only articles from this day to present will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or Number type e.g. 1 week or 20 days

Must have date

mustHaveDate

Optional

boolean

If checked, the article must have a date of release to be considered valid.

Is URL article?

isUrlArticleDefinition

Optional

object

JSON settings of what considered a link to an article. If any of them is true, then the link will be opened.

Pseudo URLs

pseudoUrls

Optional

array

Can be used to enqueue more pages like pagination or categories. Doesn't work for articles, they are recognized by the recognition system.

Max depth

maxDepth

Optional

integer

Maximum depth of crawling. 0 is only start URLs, 1 are first level links etc. Only valid for pseudo URLs

Max pages per crawl

maxPagesPerCrawl

Optional

integer

Maximum number of total pages crawled. Includes home page, pagination pages, invalid articles etc.

Max articles per crawl

maxArticlesPerCrawl

Optional

integer

Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number.

Max concurrency

maxConcurrency

Optional

integer

You can limit the speed of the scraper. Don't forget to lower your memory too to save Compute units.

Proxy configuration

proxyConfiguration

Optional

object

Proxy configuration

Use browser (Puppeteer)

useBrowser

Optional

boolean

Using browser is more expensive but gives you ability to evaluate JavaScript and wait for dynamically loaded data.

Wait for on each page

pageWaitMs

Optional

integer

How long to wait on each page before extracting data

Wait for selector on each page

pageWaitSelector

Optional

string

For what selector to wait on each page before extracting data

Extend output function

extendOutputFunction

Optional

string

A function that allows you to merge your custom extraction with the default one. You have to return an object from this function. This object will be merged/overwrite the default output for each article.

Stop after CUs

stopAfterCUs

Optional

integer

Actor run will finish after reachin certain amount of Compute units.

Notification emails

notificationEmails

Optional

array

Emails where should the bellow notifications should be sent.

Notify after CUs

notifyAfterCUs

Optional

integer

Actor will send notifications on provide email when it reaches provided CUs.

Notify every CUs (periodically)

notifyAfterCUsPeriodically

Optional

integer

Actor will send notifications on provide email every provided CUs reached from last notification.