Actor picture

Smart Article Extractor

lukaskrivka/article-extractor-smart

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

No credit card required

Author's avatarLukáš Křivka
  • Modified
  • Users899
  • Runs96,517
Actor picture
Smart Article Extractor

Smart Article Extractor

Smart Article Extractor scrapes articles from any academic, scientific, or news website or blog with just a single click. It uses a smart algorithm to decide what pages are actually articles and automatically extracts information from them.

What does Smart Article Extractor do?

If you want to download articles from websites, this tool will help you extract content using smart scraping features:

✅ Allows opening pages with a browser (Puppeteer) which can wait for dynamically loaded data

✅ Allows extraction of articles from any number of URLs

✅ Smart article recognition - the extractor can decide what pages on a website are in fact articles to be scraped (this function is customizable)

✅ Additional filters - date of articles, minimum words, and more

✅ Allows custom scraping function - you can add/overwrite your own fields from the parsed HTML

✅ Allows usage of Google Bot headers (bypassing paywalls)

Why extract articles with Smart Article Extractor?

👉 Academic research: You can use Smart Article Extractor to download multiple articles and build a corpus from them for research and article citations.

👉 Journalism: If you want to know more about how extracting articles with this tool can help text analysis and data journalism, you might like to read Terror or Clickbait? or Czech media and their word choices.

👉 Fight fake news: Monitor content by selected media to react promptly if they publish misinformation.

👉 Save time: Whatever your reason for collecting articles with Smart Article Extractor, you will definitely save a lot of time and energy.

Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website.

How much will using Smart Article Extractor cost me?

Apify gives you $5 free usage credits every month on the Apify Free plan. You can get 20k results per month from Smart Article Extractor for that, so those 20k results will be completely free!

But if you need to get more data regularly from Smart Article Extractor, you should grab an Apify subscription. We recommend our $49/month Personal plan - you can get up to 200k every month with the $49 monthly plan!

Or get two million results for $499 with the Team plan - wow!

⚠️ This can be a high-consumption actor if you don't set limits. Please make sure you set a compute unit limit in the Limit CU consumption field. ⚠️

How do I extract articles with Smart Article Extractor?

Smart Article Extractor can be run as an Apify actor on the Apify platform where it is seamlessly integrated with a nice input UI. You can also run it locally or on any other infrastructure.

On the Apify platform:

  1. Click on Try for free.
  2. Enter the URL of the website(s) you want to scrape (and other input fields to narrow down the search).
  3. Click on Save & Start.
  4. When Smart Article Extractor has finished, preview or download your results from the Output tab.

For more detailed instructions, read our step-by-step guide on how to extract articles.

Output example

If you run Smart Article Extractor on the Apify platform, you can get the output in many formats, like JSON, CSV, XML, Excel, RSS, and more. Here is a JSON example:

{
  "url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
  "softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
  "date": "2020-07-07T12:13:00.000Z",
  "author": [
    "Fariha Karim"
  ],
  "publisher": null,
  "copyright": "Times Newspapers Limited 2020",
  "favicon": "/d/img/icons/favicon-ab3ea01fbe.ico",
  "description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told",
  "lang": "en",
  "canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "tags": [],
  "image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685",
  "videos": [],
  "links": [],
  "text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room"
}

Extend output function

You can use this optional function to update the default output of this actor. This function gets a JQuery handle $ as an argument, so you can choose what data from the page you want to scrape. It also receives the currentItem parameter, which is the default output parsed by the scraper so you can explore any fields. The output from this function will get merged with the default output.

The return value of this function has to be an object!

You can return fields to achieve 3 different things:

  • Add a new field - Return an object with a field that is not in the default output
  • Change a field - Return an existing field with a new value
  • Remove a field - Return an existing field with a value undefined

Let's say you want to accomplish this:

  • Remove links and videos fields from the output
  • Add a pageTitle field
  • Change the date selector (In rare cases the scraper is not able to find it)
  • Save the original date parsed so you can compare with your date
($, currentItem) => {
    return {
        links: undefined,
        videos: undefined,
        pageTitle: $('title').text(),
        date: $('.my-date-selector').text(),
        originalDate: currentItem.date,
    }
}

Integrations and Smart Article Extractor

Last but not least, Smart Article Extractor can be connected with almost any cloud service or web app thanks to integrations on the Apify platform. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and more. Or you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever Smart Article Extractor successfully finishes a run.

Using Smart Article Extractor with the Apify API

The Apify API gives you programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints that enable you to manage, schedule, and run Apify actors. The API also lets you access any datasets, monitor actor performance, fetch results, create and update versions, and more.

To access the API using Node.js, use the apify-client NPM package. To access the API using Python, use the apify-client PyPI package.

Check out the Apify API reference docs for full details or click on the API tab for code examples.

Industries

See how Smart Article Extractor is used in industries around the world