Actor picture

Smart Article Extractor

lukaskrivka/article-extractor-smart

Based on a website URL this actor will extract detailed data from all articles. It is a smarter version of Article Extractor. Automatically recognizes what is an article. It can be used for extracting news from BBC, CNN, Bloomberg, and other popular news websites.

Smart article extractor

This actor is an extension of Apify's Article Text Extractor. It has several extra features:

  • Allows extraction of any number of URLs - support for Start URLs, Pseudo URLs and max crawling depth
  • Smart article recognition - Actor can decide what pages on a website are in fact articles to be scraped. This is customizable.
  • Additional filters - Date of articles, minimum words
  • Date normalization
  • Some extra data fields
  • Allows custom scraping function - You can add/overwrite your own fields from the parsed HTML
  • Allows using Google Bot headers (bypassing paywalls)

Example output:

More detailed documentation to come...

Extend output function (optional)

You can use this function to update the default output of this actor. This function gets a JQuery handle $ as an argument so you can choose what data from the page you want to scrape. The output from this will function will get merged with the default output.

The return value of this function has to be an object!

You can return fields to achive 3 different things:

  • Add a new field - Return object with a field that is not in the default output
  • Change a field - Return an existing field with a new value
  • Remove a field - Return an existing field with a value undefined

Let's say that you want to accomplish this

  • Remove links and videos fields from the output
  • Add a pageTitle field
  • Change the date selector (In rare cases the scraper is not able to find it)
($) => {
    return {
        links: undefined,
        videos: undefined,
        pageTitle: $('title').text(),
        date: $('.my-date-selector').text()
    }
}
  • Modified
  • Last run
  • Used2973 times
  • Used by90 users