No credit card required

Smart Article Extractor

lukaskrivka/article-extractor-smart

No credit card required

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

Do you want to learn more about this Actor?

Get a demo

Smart Article Extractor

Smart Article Extractor scrapes articles from any academic, scientific, or news website or blog with just a single click. It uses a smart algorithm to decide what pages are actually articles and automatically extracts information from them.

What does Smart Article Extractor do?

If you want to download articles from websites, this tool will help you extract content using smart scraping features:

✅ Allows opening pages with a browser (Puppeteer) which can wait for dynamically loaded data

✅ Allows extraction of articles from any number of URLs

✅ Smart article recognition - the extractor can decide what pages on a website are in fact articles to be scraped (this function is customizable)

✅ Additional filters - date of articles, minimum words, and more

✅ Allows custom scraping function - you can add/overwrite your own fields from the parsed HTML

✅ Allows usage of Google Bot headers (bypassing paywalls)

Why extract articles with Smart Article Extractor?

👉 Academic research: You can use Smart Article Extractor to download multiple articles and build a corpus from them for research and article citations.

👉 Journalism: If you want to know more about how extracting articles with this tool can help text analysis and data journalism, you might like to read Terror or Clickbait? or Czech media and their word choices.

👉 Fight fake news: Monitor content by selected media to react promptly if they publish misinformation.

👉 Save time: Whatever your reason for collecting articles with Smart Article Extractor, you will definitely save a lot of time and energy.

Is it legal to extract articles?

Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website.

How many results can you scrape with Smart Article Extractor?

Smart Article Extractor can return thousands of results on average. However, you have to keep in mind that scraping news websites has many variables to it and may cause the results to fluctuate case by case. There’s no one-size-fits-all-use-cases number. The maximum number of results may vary depending on the complexity of the input, location, and other factors. Some of the most frequent cases are:

website gives a different number of results depending on the type/value of the input
website has an internal limit that no scraper can cross
scraper has a limit that we are working on improving

Therefore, while we regularly run Actor tests to keep the benchmarks in check, the results may also fluctuate without our knowing. The best way to know for sure for your particular use case is to do a test run yourself.

How much will scraping articles with Smart Article Extractor cost you?

When it comes to scraping, it can be challenging to estimate the resources needed to extract data as use cases may vary significantly. That's why the best course of action is to run a test scrape with a small sample of input data and limited output. You’ll get your price per scrape, which you’ll then multiply by the number of scrapes you intend to do.

Watch this video for a few helpful tips. And don't forget that choosing a higher plan will save you money in the long run.

⚠️ This can be a high-consumption actor if you don't set limits. Please make sure you set a compute unit limit in the Limit CU consumption field. ⚠️

How do I extract articles with Smart Article Extractor?

Smart Article Extractor can be run as an Apify actor on the Apify platform where it is seamlessly integrated with a nice input UI. You can also run it locally or on any other infrastructure.

On the Apify platform:

Click on Try for free.
Enter the URL of the website(s) you want to scrape (and other input fields to narrow down the search).
Click on Save & Start.
When Smart Article Extractor has finished, preview or download your results from the Output tab.

For more detailed instructions, read our step-by-step guide on how to extract articles.

Output example

If you run Smart Article Extractor on the Apify platform, you can get the output in many formats, like JSON, CSV, XML, Excel, RSS, and more. Here is a JSON example:

1{
2  "url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
3  "loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
4  "title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
5  "softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
6  "date": "2020-07-07T12:13:00.000Z",
7  "author": [
8    "Fariha Karim"
9  ],
10  "publisher": null,
11  "copyright": "Times Newspapers Limited 2020",
12  "favicon": "/d/img/icons/favicon-ab3ea01fbe.ico",
13  "description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told",
14  "lang": "en",
15  "canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
16  "tags": [],
17  "image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685",
18  "videos": [],
19  "links": [],
20  "text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room"
21}

Extend output function

You can use this optional function to update the default output of this actor. This function gets a JQuery handle $ as an argument, so you can choose what data from the page you want to scrape. It also receives the currentItem parameter, which is the default output parsed by the scraper so you can explore any fields. The output from this function will get merged with the default output.

The return value of this function has to be an object!

You can return fields to achieve 3 different things:

Add a new field - Return an object with a field that is not in the default output
Change a field - Return an existing field with a new value
Remove a field - Return an existing field with a value undefined

Let's say you want to accomplish this:

Remove links and videos fields from the output
Add a pageTitle field
Change the date selector (In rare cases the scraper is not able to find it)
Save the original date parsed so you can compare with your date

1($, currentItem) => {
2    return {
3        links: undefined,
4        videos: undefined,
5        pageTitle: $('title').text(),
6        date: $('.my-date-selector').text(),
7        originalDate: currentItem.date,
8    }
9}

Integrations and Smart Article Extractor

Last but not least, Smart Article Extractor can be connected with almost any cloud service or web app thanks to integrations on the Apify platform. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and more. Or you can use webhooks to carry out an action whenever an event occurs, e.g. get a notification whenever Smart Article Extractor successfully finishes a run.

Using Smart Article Extractor with the Apify API

The Apify API gives you programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints that enable you to manage, schedule, and run Apify actors. The API also lets you access any datasets, monitor actor performance, fetch results, create and update versions, and more.

To access the API using Node.js, use the apify-client NPM package. To access the API using Python, use the apify-client PyPI package.

Check out the Apify API reference docs for full details or click on the API tab for code examples.

Not your cup of tea? Build your own scraper

Smart Article Extractor doesn’t exactly do what you need? You can always build your own! We have various scraper templates in Python, JavaScript, and TypeScript to get you started. Alternatively, you can write it from scratch using our open-source library Crawlee. You can keep the scraper to yourself or make it public by adding it to Apify Store (and find users for it).

Or let us know if you need a custom scraping solution.

Your feedback

We’re always working on improving the performance of our Actors. So if you’ve got any technical feedback for Smart Article Extractor or simply found a bug, please create an issue on the Actor’s Issues tab in Apify Console.

Developer

Lukáš Křivka

Actor metrics

194 monthly users
61 stars
99.5% runs succeeded
1.2 days response time
Created in Nov 2019
Modified 3 months ago

Categories

News

YouTube Full Channel Transcripts Extractor

karamelo/youtube-full-channel-transcripts-extractor

With only the channel link You can extract 1 to 1000s of all the transcripts of a channel, be it videos or shorts or streams/lives or even podcasts and playlists, you name it. Get all the transcripts/captions organized with video ID and title in a nice table or JSON or CSV to download.

karamelo

Wordpress Post Scraper - NEW

eloquent_mountain/wordpress-post-scraper---new

This actor scrapes WordPress blog posts of one or more websites, cleans the HTML content, and pushes flattened JSON data (collects all data it can find in the post). It uses Selenium to handle pages requiring JavaScript rendering.

Paco

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

24k

567

Tripadvisor Scraper

maxcopell/tripadvisor

This unofficial Tripadvisor API is a data extraction tool able to get data on hotels, restaurants, things to do, vacation rentals, attractions, tours, and public trips. Get pricing, contact details, amenities, awards, ratings, and more. Download your data in Excel, JSON, CSV, and other formats.

Maximillian Copelli

Facebook Reels Video Scraper

apify/facebook-reels-scraper

Extract data from hundreds of Facebook reels from one or multiple Facebook pages and profiles. Get reel URL, text, page or profile URL, timestamp, number of plays and more. Download the data in JSON, CSV, and Excel and use it in apps, spreadsheets, and reports.

Apify

Pro Web Content Crawler (With Images)

assertive_analogy/pro-web-content-crawler

Pro Web Content Crawler is a powerful tool that digs deep into web content and images. It handles complex sites, dynamic pages, and hidden content, making it perfect for extracting both data and images. Customizable and API-ready for your unique data needs.

Gideon Nesh

YouTube Scraper

streamers/youtube-scraper

YouTube crawler and video scraper. Alternative YouTube API with no limits or quotas. Extract and download channel name, likes, number of views, and number of subscribers.

Streamers

8.2k

151

Google Search Results Scraper

apify/google-search-scraper

Scrape Google Search Engine Results Pages (SERPs). Select the country or language and extract organic and paid results, ads, queries, People Also Ask, prices, reviews, like a Google SERP API. Export scraped data, run the scraper via API, schedule and monitor runs, or integrate with other tools.

Apify

48.5k

204

How to extract and download news articles online

What is an online aggregator and how to create your own

How to build a news aggregator with Next.js, Resend, and Apify

Build new tools

Are you a developer? Build your own Actors and run them on Apify.

Learn more

Get a custom solution

Get a custom web scraping or RPA solution.

Book a demo

Smart Article Extractor

Smart Article Extractor

Smart Article Extractor

What does Smart Article Extractor do?