Smart Article Extractor

  • lukaskrivka/article-extractor-smart
  • Modified
  • Users 2.2k
  • Runs 410.9k
  • Created by Author's avatarLukáš Křivka

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Smart Article Extractor

Smart Article Extractor scrapes articles from any academic, scientific, or news website or blog with just a single click. It uses a smart algorithm to decide what pages are actually articles and automatically extracts information from them.

What does Smart Article Extractor do?

If you want to download articles from websites, this tool will help you extract content using smart scraping features:

✅ Allows opening pages with a browser (Puppeteer) which can wait for dynamically loaded data

✅ Allows extraction of articles from any number of URLs

✅ Smart article recognition - the extractor can decide what pages on a website are in fact articles to be scraped (this function is customizable)

✅ Additional filters - date of articles, minimum words, and more

✅ Allows custom scraping function - you can add/overwrite your own fields from the parsed HTML

✅ Allows usage of Google Bot headers (bypassing paywalls)

Why extract articles with Smart Article Extractor?

👉 Academic research: You can use Smart Article Extractor to download multiple articles and build a corpus from them for research and article citations.

👉 Journalism: If you want to know more about how extracting articles with this tool can help text analysis and data journalism, you might like to read Terror or Clickbait? or Czech media and their word choices.

👉 Fight fake news: Monitor content by selected media to react promptly if they publish misinformation.

👉 Save time: Whatever your reason for collecting articles with Smart Article Extractor, you will definitely save a lot of time and energy.

Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website.

How many results can you scrape with Smart Article Extractor?

Smart Article Extractor can return thousands of results on average. However, you have to keep in mind that scraping news websites has many variables to it and may cause the results to fluctuate case by case. There’s no one-size-fits-all-use-cases number. The maximum number of results may vary depending on the complexity of the input, location, and other factors. Some of the most frequent cases are:

  • website gives a different number of results depending on the type/value of the input
  • website has an internal limit that no scraper can cross
  • scraper has a limit that we are working on improving

Therefore, while we regularly run Actor tests to keep the benchmarks in check, the results may also fluctuate without our knowing. The best way to know for sure for your particular use case is to do a test run yourself.

How much will scraping articles with Smart Article Extractor cost you?

When it comes to scraping, it can be challenging to estimate the resources needed to extract data as use cases may vary significantly. That's why the best course of action is to run a test scrape with a small sample of input data and limited output. You’ll get your price per scrape, which you’ll then multiply by the number of scrapes you intend to do.

Watch this video for a few helpful tips. And don't forget that choosing a higher plan will save you money in the long run.

⚠️ This can be a high-consumption actor if you don't set limits. Please make sure you set a compute unit limit in the Limit CU consumption field. ⚠️

How do I extract articles with Smart Article Extractor?

Smart Article Extractor can be run as an Apify actor on the Apify platform where it is seamlessly integrated with a nice input UI. You can also run it locally or on any other infrastructure.

On the Apify platform:

  1. Click on Try for free.
  2. Enter the URL of the website(s) you want to scrape (and other input fields to narrow down the search).
  3. Click on Save & Start.
  4. When Smart Article Extractor has finished, preview or download your results from the Output tab.

For more detailed instructions, read our step-by-step guide on how to extract articles.

Output example

If you run Smart Article Extractor on the Apify platform, you can get the output in many formats, like JSON, CSV, XML, Excel, RSS, and more. Here is a JSON example:

{ "url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89", "loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89", "title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told", "softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told", "date": "2020-07-07T12:13:00.000Z", "author": [ "Fariha Karim" ], "publisher": null, "copyright": "Times Newspapers Limited 2020", "favicon": "/d/img/icons/favicon-ab3ea01fbe.ico", "description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told", "lang": "en", "canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89", "tags": [], "image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685", "videos": [], "links": [], "text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room" }

Extend output function

You can use this optional function to update the default output of this actor. This function gets a JQuery handle $ as an argument, so you can choose what data from the page you want to scrape. It also receives the currentItem parameter, which is the default output parsed by the scraper so you can explore any fields. The output from this function will get merged with the default output.

The return value of this function has to be an object!

You can return fields to achieve 3 different things:

  • Add a new field - Return an object with a field that is not in the default output
  • Change a field - Return an existing field with a new value
  • Remove a field - Return an existing field with a value undefined

Let's say you want to accomplish this:

  • Remove links and videos fields from the output
  • Add a pageTitle field
  • Change the date selector (In rare cases the scraper is not able to find it)
  • Save the original date parsed so you can compare with your date
($, currentItem) => { return { links: undefined, videos: undefined, pageTitle: $('title').text(), date: $('.my-date-selector').text(), originalDate: currentItem.date, } }

Integrations and Smart Article Extractor

Last but not least, Smart Article Extractor can be connected with almost any cloud service or web app thanks to integrations on the Apify platform. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and more. Or you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever Smart Article Extractor successfully finishes a run.

Using Smart Article Extractor with the Apify API

The Apify API gives you programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints that enable you to manage, schedule, and run Apify actors. The API also lets you access any datasets, monitor actor performance, fetch results, create and update versions, and more.

To access the API using Node.js, use the apify-client NPM package. To access the API using Python, use the apify-client PyPI package.

Check out the Apify API reference docs for full details or click on the API tab for code examples.

Not your cup of tea? Build your own scraper

Smart Article Extractor doesn’t exactly do what you need? You can always build your own! We have various scraper templates in Python, JavaScript, and TypeScript to get you started. Alternatively, you can write it from scratch using our open-source library Crawlee. You can keep the scraper to yourself or make it public by adding it to Apify Store (and find users for it).

Or let us know if you need a custom scraping solution.

Your feedback

We’re always working on improving the performance of our Actors. So if you’ve got any technical feedback for Smart Article Extractor or simply found a bug, please create an issue on the Actor’s Issues tab in Apify Console.