Smart Article Extractor

Pricing

Pay per usage

Try for free

Go to Apify Store

Smart Article Extractor

Try for free

Developed by

Lukáš Křivka

Maintained by Apify

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

5.0 (7)

Pricing

Pay per usage

160

6.1K

474

Issues response

2.2 days

Last modified

7 months ago

News

Website/category URLs

startUrlsarrayOptional

These could be the main page URL or any category/subpage URL, e.g. https://www.bbc.com/. Article pages are detected and crawled from these. If you prefer to use direct article URLs, use articleUrls input instead

Article URLs

articleUrlsarrayOptional

These are direct URLs for the articles to be extracted, e.g. https://www.bbc.com/news/uk-62836057. No extra pages are crawled from article pages.

Only new articles (only for small runs)

onlyNewArticlesbooleanOptional

This option is only viable for smaller runs. If you plan to use this on a large scale, use the 'Only new articles (saved per domain)' option below instead. If this function is selected, the extractor will only scrape new articles each time you run it. (Scraped URLs are saved in a dataset named articles-state, and are compared with new ones.)

Default value of this property is false

Only new articles (saved per domain, preferable)

onlyNewArticlesPerDomainbooleanOptional

If this function is selected, the extractor will only scrape only new articles each time you run it. (Scraped articles are saved in one dataset, named 'ARTICLES-SCRAPED-domain', per each domain, and compared with new ones.)

Default value of this property is false

Only inside domain articles

onlyInsideArticlesbooleanOptional

If this function is selected, the extractor will only scrape articles that are on the domain from where they are linked. If the domain presents links to articles on different domains, those articles will not be scraped, e.g. https://www.bbc.com/ vs. https://www.bbc.co.uk/.

Default value of this property is true

Enqueue articles from articles

enqueueFromArticlesbooleanOptional

Normally, the scraper only extracts articles from category pages. This option allows the scraper to also extract articles linked within articles.

Default value of this property is false

Crawl whole subdomain (same base as Start URL)

crawlWholeSubdomainbooleanOptional

Automatically enqueue categories and articles from whole subdomain with the same path. E.g. if Start URL is https://apify.com/store, it will enqueue all pages starting with https://apify.com/store

Default value of this property is false

Limit articles to only from subdomain

onlySubdomainArticlesbooleanOptional

Only loads articles which URL begins with the same path as Start URL. E.g. if Start URL is https://apify.com/store, it will only load articles starting with https://apify.com/store

Default value of this property is false

Find articles in sitemaps (caution)

scanSitemapsbooleanOptional

We recommend using Sitemap URLs instead. If this function is selected, the extractor will scan different sitemaps from the initial article URL. Keep in mind that this option can lead to the loading of a huge amount of (sometimes old) articles, in which case the time and cost of the scrape will increase.

Default value of this property is false

Sitemap URLs (safer)

sitemapUrlsarrayOptional

You can provide selected sitemap URLs that include the articles you need to extract.

Save full HTML

saveHtmlbooleanOptional

If this function is selected, the scraper will save the full HTML of the article page, but this will make the data less readable.

Save full HTML (only as link to it)

saveHtmlAsLinkbooleanOptional

If this function is selected, the scraper will save the full HTML of the article page as a URL to keep the dataset clean and small.

Save screenshots of article pages (browser only)

saveSnapshotsbooleanOptional

Stores a screenshot for each article page to Key-Value Store and provides that as screenshotUrl. Useful for debugging.

Default value of this property is false

Use Googlebot headers

useGoogleBotHeadersbooleanOptional

This option will allow you to bypass protection and paywalls on some websites. Use with caution as it might lead to getting blocked.

Default value of this property is false

Minimum words

minWordsintegerOptional

The article needs to contain at least this number of words to be extracted

Default value of this property is 150

Extract articles from [date]

dateFromstringOptional

Only articles from this day on will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or number type e.g. 1 week or 20 days

Only articles for last X days

onlyArticlesForLastDaysintegerOptional

Only get posts that were published in the last X days from time the scraping starts. Use either this or the absolute date.

Must have date

mustHaveDatebooleanOptional

If checked, the article must have a date of release to be extracted.

Default value of this property is true

Is the URL an article?

isUrlArticleDefinitionobjectOptional

Here you can input JSON settings to define what URLs should be considered articles by the scraper. If any of them is true, then the link will be opened and the article extracted.

Pseudo URLs

pseudoUrlsarrayOptional

This function can be used to enqueue more pages, i.e. include more links like pagination or categories. This doesn't work for articles, as they are recognized by the recognition system.

Link selector

linkSelectorstringOptional

You can limit the

Max depth

maxDepthintegerOptional

Maximum depth of crawling, i.e. how many times the scraper picks up a link to other webpages. Level 0 refers to the start URLs, 1 are the first level links, and so on. This is only valid for pseudo URLs

Max pages per crawl

maxPagesPerCrawlintegerOptional

Maximum number of total pages crawled. It includes the home page, pagination pages, invalid articles, and so on. The crawler will stop automatically after reaching this number.

Max articles per crawl

maxArticlesPerCrawlintegerOptional

Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number.

Max articles per start URL

maxArticlesPerStartUrlintegerOptional

Maximum number of articles scraped per start URL.

Max concurrency

maxConcurrencyintegerOptional

You can limit the speed of the scraper to avoid getting blocked.

Proxy configuration

proxyConfigurationobjectOptional

Proxy configuration

Override proxy group

overrideProxyGroupstringOptional

If you want to override the default proxy group, you can specify it here. This is useful if you want to use a different proxy group for each crawler.

Use browser (Puppeteer)

useBrowserbooleanOptional

This option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.

Default value of this property is false

Wait on each page (ms)

pageWaitMsintegerOptional

How many milliseconds to wait on each page before extracting data

Wait until navigation event is finished

navigationWaitUntilEnumOptional

What to wait until the navigation is finished. domcontentloaded happens when initial HTML loads and is fastest. load happens when JS is executed and it is default. networkidle0, networkidle2 wait for background network but cannot cause infinite loading.

Value options:

"load": string"domcontentloaded": string"networkidle0": string"networkidle2": string

Default value of this property is "load"

Wait for selector on each category page

pageWaitSelectorCategorystringOptional

For what selector to wait on each page before extracting data

Wait for selector on each article page

pageWaitSelectorArticlestringOptional

For what selector to wait on each page before extracting data

Scroll to bottom of the page (infinite scroll)

scrollToBottombooleanOptional

Scroll to the botton of the page, loading dynamic articles.

Scroll to bottom button selector

scrollToBottomButtonSelectorstringOptional

CSS selector for a button to load more articles

Scroll to bottom max seconds

scrollToBottomMaxSecsintegerOptional

Limit for how long the scrolling can run so it does no go infinite.

Extend output function

extendOutputFunctionstringOptional

This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.

Limit CU consumption

stopAfterCUsintegerOptional

The scraper will stop running after reaching this number of compute units.

Emails address for notifications

notificationEmailsarrayOptional

Notifications will be sent to these email addresses.

Notify after [number] CUs

notifyAfterCUsintegerOptional

The scraper will send notifications to the provided email when it reaches this number of CUs.

Notify every [number] CUs

notifyAfterCUsPeriodicallyintegerOptional

The scraper will send notifications to the provided email every time this number of CUs is reached since the last notification.

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

216

5.0

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

611

5.0

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

Articles and Content Scraper

web.harvester/articles-and-content-scraper

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Web Harvester

5.0

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

Universal Article Scraper

universal_scraping/universal-article-scraper

Universal article scraper for news websites, blogs, etc. It can scrape articles from multiple websites simultaneously, including metadata such as title, content, publication date, image, and author.

Michael Novak

5.0

News Articles Scraper

proscraper/news-articles-scraper

Scrape data for news articles. Takes in list of URL's in start_urls and returns the data. Can be used to feed LLM models or training.

Owais Nazir

Advanced News Scraper

dorcy/advanced-news-scraper

This scraper is crafted to extract the latest news articles based on custom search queries, providing a wealth of information, including article titles, sources, publication dates, full article text, and AI-generated summary.

Dorcy Shema

221

Tech News Article Scraper

inquisitive_sarangi/news-article-scraper

Tech News Article Scraper is a simple yet powerful tool to extract news articles from a variety of popular news websites. Supported The Verge, CNET, Wired, TechCrunch, Ars Technica

API Master

Article Text Extractor

mtrunkat/article-text-extractor

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Marek Trunkát

1.1K

5.0

Smart Article Extractor

Smart Article Extractor

Website/category URLs

Article URLs

Only new articles (only for small runs)

Only new articles (saved per domain, preferable)

Only inside domain articles

Enqueue articles from articles

Crawl whole subdomain (same base as Start URL)

Limit articles to only from subdomain

Find articles in sitemaps (caution)

Sitemap URLs (safer)

Save full HTML

Save full HTML (only as link to it)

Save screenshots of article pages (browser only)

Use Googlebot headers

Minimum words

Extract articles from [date]

Only articles for last X days

Must have date

Is the URL an article?

Pseudo URLs

Link selector

Max depth

Max pages per crawl

Max articles per crawl

Max articles per start URL

Max concurrency

Proxy configuration

Override proxy group

Use browser (Puppeteer)

Wait on each page (ms)

Wait until navigation event is finished

Value options:

Wait for selector on each category page

Wait for selector on each article page

Scroll to bottom of the page (infinite scroll)

Scroll to bottom button selector

Scroll to bottom max seconds

Extend output function

Limit CU consumption

Emails address for notifications

Notify after [number] CUs

Notify every [number] CUs

You might also like

News Website Crawler & Article Extractor

Articles Extractor

News Article Scraper for Feeding LLM

Articles and Content Scraper

Article Content Extractor 📄

Universal Article Scraper

News Articles Scraper

Advanced News Scraper

Tech News Article Scraper

Article Text Extractor

Website/category URLs

Article URLs

Only new articles (only for small runs)

Only new articles (saved per domain, preferable)

Only inside domain articles

Enqueue articles from articles

Crawl whole subdomain (same base as Start URL)

Limit articles to only from subdomain

Find articles in sitemaps (caution)

Sitemap URLs (safer)

Save full HTML

Save full HTML (only as link to it)

Save screenshots of article pages (browser only)

Use Googlebot headers

Minimum words

Extract articles from [date]

Only articles for last X days

Must have date

Is the URL an article?

Pseudo URLs

Link selector

Max depth

Max pages per crawl

Max articles per crawl

Max articles per start URL