Pricing

Pay per usage

Try for free

Go to Apify Store

Smart Article Extractor

Try for free

Developed by

Lukáš Křivka

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

4.7 (6)

Pricing

Pay per usage

Issues response

26 days

Last modified

5 months ago

News

Smart Article Extractor

Smart Article Extractor scrapes articles from any academic, scientific, or news website or blog with just a single click. It uses a smart algorithm to decide what pages are actually articles and automatically extracts information from them.

What does Smart Article Extractor do?

If you want to download articles from websites, this tool will help you extract content using smart scraping features:

✅ Allows opening pages with a browser (Puppeteer) which can wait for dynamically loaded data

✅ Allows extraction of articles from any number of URLs

✅ Smart article recognition - the extractor can decide what pages on a website are in fact articles to be scraped (this function is customizable)

✅ Additional filters - date of articles, minimum words, and more

✅ Allows custom scraping function - you can add/overwrite your own fields from the parsed HTML

✅ Allows usage of Google Bot headers (bypassing paywalls)

Why extract articles with Smart Article Extractor?

👉 Academic research: You can use Smart Article Extractor to download multiple articles and build a corpus from them for research and article citations.

👉 Journalism: If you want to know more about how extracting articles with this tool can help text analysis and data journalism, you might like to read Terror or Clickbait? or Czech media and their word choices.

👉 Fight fake news: Monitor content by selected media to react promptly if they publish misinformation.

👉 Save time: Whatever your reason for collecting articles with Smart Article Extractor, you will definitely save a lot of time and energy.

Is it legal to extract articles?

Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website.

How many results can you scrape with Smart Article Extractor?

Smart Article Extractor can return thousands of results on average. However, you have to keep in mind that scraping news websites has many variables to it and may cause the results to fluctuate case by case. There’s no one-size-fits-all-use-cases number. The maximum number of results may vary depending on the complexity of the input, location, and other factors. Some of the most frequent cases are:

website gives a different number of results depending on the type/value of the input
website has an internal limit that no scraper can cross
scraper has a limit that we are working on improving

Therefore, while we regularly run Actor tests to keep the benchmarks in check, the results may also fluctuate without our knowing. The best way to know for sure for your particular use case is to do a test run yourself.

How much will scraping articles with Smart Article Extractor cost you?

When it comes to scraping, it can be challenging to estimate the resources needed to extract data as use cases may vary significantly. That's why the best course of action is to run a test scrape with a small sample of input data and limited output. You’ll get your price per scrape, which you’ll then multiply by the number of scrapes you intend to do.

Watch this video for a few helpful tips. And don't forget that choosing a higher plan will save you money in the long run.

⚠️ This can be a high-consumption actor if you don't set limits. Please make sure you set a compute unit limit in the Limit CU consumption field. ⚠️

How do I extract articles with Smart Article Extractor?

Smart Article Extractor can be run as an Apify actor on the Apify platform where it is seamlessly integrated with a nice input UI. You can also run it locally or on any other infrastructure.

On the Apify platform:

Click on Try for free.
Enter the URL of the website(s) you want to scrape (and other input fields to narrow down the search).
Click on Save & Start.
When Smart Article Extractor has finished, preview or download your results from the Output tab.

For more detailed instructions, read our step-by-step guide on how to extract articles.

Output example

If you run Smart Article Extractor on the Apify platform, you can get the output in many formats, like JSON, CSV, XML, Excel, RSS, and more. Here is a JSON example:

{
  "url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
  "softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
  "date": "2020-07-07T12:13:00.000Z",
  "author": [
    "Fariha Karim"
  ],
  "publisher": null,
  "copyright": "Times Newspapers Limited 2020",
  "favicon": "/d/img/icons/favicon-ab3ea01fbe.ico",
  "description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told",
  "lang": "en",
  "canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "tags": [],
  "image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685",
  "videos": [],
  "links": [],
  "text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room"
}

Extend output function

You can use this optional function to update the default output of this actor. This function gets a JQuery handle $ as an argument, so you can choose what data from the page you want to scrape. It also receives the currentItem parameter, which is the default output parsed by the scraper so you can explore any fields. The output from this function will get merged with the default output.

The return value of this function has to be an object!

You can return fields to achieve 3 different things:

Add a new field - Return an object with a field that is not in the default output
Change a field - Return an existing field with a new value
Remove a field - Return an existing field with a value undefined

Let's say you want to accomplish this:

Remove links and videos fields from the output
Add a pageTitle field
Change the date selector (In rare cases the scraper is not able to find it)
Save the original date parsed so you can compare with your date

($, currentItem) => {
    return {
        links: undefined,
        videos: undefined,
        pageTitle: $('title').text(),
        date: $('.my-date-selector').text(),
        originalDate: currentItem.date,
    }
}

Integrations and Smart Article Extractor

Last but not least, Smart Article Extractor can be connected with almost any cloud service or web app thanks to integrations on the Apify platform. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and more. Or you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever Smart Article Extractor successfully finishes a run.

Using Smart Article Extractor with the Apify API

The Apify API gives you programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints that enable you to manage, schedule, and run Apify actors. The API also lets you access any datasets, monitor actor performance, fetch results, create and update versions, and more.

To access the API using Node.js, use the apify-client NPM package. To access the API using Python, use the apify-client PyPI package.

Check out the Apify API reference docs for full details or click on the API tab for code examples.

Not your cup of tea? Build your own scraper

Smart Article Extractor doesn’t exactly do what you need? You can always build your own! We have various scraper templates in Python, JavaScript, and TypeScript to get you started. Alternatively, you can write it from scratch using our open-source library Crawlee. You can keep the scraper to yourself or make it public by adding it to Apify Store (and find users for it).

Or let us know if you need a custom scraping solution.

Your feedback

We’re always working on improving the performance of our Actors. So if you’ve got any technical feedback for Smart Article Extractor or simply found a bug, please create an issue on the Actor’s Issues tab in Apify Console.

On this page

Share Actor:

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

147

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

563

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

News Articles Scraper

proscraper/news-articles-scraper

Scrape data for news articles. Takes in list of URL's in start_urls and returns the data. Can be used to feed LLM models or training.

Owais Nazir

Advanced News Scraper

dorcy/advanced-news-scraper

This scraper is crafted to extract the latest news articles based on custom search queries, providing a wealth of information, including article titles, sources, publication dates, full article text, and AI-generated summary.

Dorcy Shema

209

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

Unlock valuable insights from any article! Get clean text, publication data, keywords, summaries, and more. Ideal for research, content marketing, and competitive analysis. Fast, reliable, and easy to use.

Xtech

1.0

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

Tech News Article Scraper

inquisitive_sarangi/news-article-scraper

Tech News Article Scraper is a simple yet powerful tool to extract news articles from a variety of popular news websites. Supported The Verge, CNET, Wired, TechCrunch, Ars Technica

API Master

Fast Google News Scraper

aymorato/fast-google-news-scraper

Extract details from Google News articles, such as images, titles, links, and other relevant information.

Alwin Morato

142

Fast News Scraper

timgreen/fast-news-scraper

Extract full article text and metadata from popular news sites like The New York Times, AP News, Reuters, CNBC, NPR, and Wired. Scrape thousands of articles in just a few minutes.