Smart Article Extractor
No credit card required
Smart Article Extractor
No credit card required
📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.
Do you want to learn more about this Actor?
Get a demoWebsite/category URLs
startUrls
arrayOptional
These could be the main page URL or any category/subpage URL, e.g. https://www.bbc.com/. Article pages are detected and crawled from these. If you prefer to use direct article URLs, use articleUrls
input instead
Article URLs
articleUrls
arrayOptional
These are direct URLs for the articles to be extracted, e.g. https://www.bbc.com/news/uk-62836057. No extra pages are crawled from article pages.
Only new articles (only for small runs)
onlyNewArticles
booleanOptional
This option is only viable for smaller runs. If you plan to use this on a large scale, use the 'Only new articles (saved per domain)' option below instead. If this function is selected, the extractor will only scrape new articles each time you run it. (Scraped URLs are saved in a dataset named articles-state
, and are compared with new ones.)
Default value of this property is false
Only new articles (saved per domain, preferable)
onlyNewArticlesPerDomain
booleanOptional
If this function is selected, the extractor will only scrape only new articles each time you run it. (Scraped articles are saved in one dataset, named 'ARTICLES-SCRAPED-domain', per each domain, and compared with new ones.)
Default value of this property is false
Only inside domain articles
onlyInsideArticles
booleanOptional
If this function is selected, the extractor will only scrape articles that are on the domain from where they are linked. If the domain presents links to articles on different domains, those articles will not be scraped, e.g. https://www.bbc.com/ vs. https://www.bbc.co.uk/.
Default value of this property is true
Enqueue articles from articles
enqueueFromArticles
booleanOptional
Normally, the scraper only extracts articles from category pages. This option allows the scraper to also extract articles linked within articles.
Default value of this property is false
Crawl whole subdomain (same base as Start URL)
crawlWholeSubdomain
booleanOptional
Automatically enqueue categories and articles from whole subdomain with the same path. E.g. if Start URL is https://apify.com/store, it will enqueue all pages starting with https://apify.com/store
Default value of this property is false
Limit articles to only from subdomain
onlySubdomainArticles
booleanOptional
Only loads articles which URL begins with the same path as Start URL. E.g. if Start URL is https://apify.com/store, it will only load articles starting with https://apify.com/store
Default value of this property is false
Find articles in sitemaps (caution)
scanSitemaps
booleanOptional
We recommend using Sitemap URLs
instead.
If this function is selected, the extractor will scan different sitemaps from the initial article URL. Keep in mind that this option can lead to the loading of a huge amount of (sometimes old) articles, in which case the time and cost of the scrape will increase.
Default value of this property is false
Sitemap URLs (safer)
sitemapUrls
arrayOptional
You can provide selected sitemap URLs that include the articles you need to extract.
Save full HTML
saveHtml
booleanOptional
If this function is selected, the scraper will save the full HTML of the article page, but this will make the data less readable.
Save full HTML (only as link to it)
saveHtmlAsLink
booleanOptional
If this function is selected, the scraper will save the full HTML of the article page as a URL to keep the dataset clean and small.
Save screenshots of article pages (browser only)
saveSnapshots
booleanOptional
Stores a screenshot for each article page to Key-Value Store and provides that as screenshotUrl. Useful for debugging.
Default value of this property is false
Use Googlebot headers
useGoogleBotHeaders
booleanOptional
This option will allow you to bypass protection and paywalls on some websites. Use with caution as it might lead to getting blocked.
Default value of this property is false
Minimum words
minWords
integerOptional
The article needs to contain at least this number of words to be extracted
Default value of this property is 150
Extract articles from [date]
dateFrom
stringOptional
Only articles from this day on will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or number type e.g. 1 week or 20 days
Only articles for last X days
onlyArticlesForLastDays
integerOptional
Only get posts that were published in the last X days from time the scraping starts. Use either this or the absolute date.
Must have date
mustHaveDate
booleanOptional
If checked, the article must have a date of release to be extracted.
Default value of this property is true
Is the URL an article?
isUrlArticleDefinition
objectOptional
Here you can input JSON settings to define what URLs should be considered articles by the scraper. If any of them is true
, then the link will be opened and the article extracted.
Pseudo URLs
pseudoUrls
arrayOptional
This function can be used to enqueue more pages, i.e. include more links like pagination or categories. This doesn't work for articles, as they are recognized by the recognition system.
Max depth
maxDepth
integerOptional
Maximum depth of crawling, i.e. how many times the scraper picks up a link to other webpages. Level 0 refers to the start URLs, 1 are the first level links, and so on. This is only valid for pseudo URLs
Max pages per crawl
maxPagesPerCrawl
integerOptional
Maximum number of total pages crawled. It includes the home page, pagination pages, invalid articles, and so on. The crawler will stop automatically after reaching this number.
Max articles per crawl
maxArticlesPerCrawl
integerOptional
Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number.
Max articles per start URL
maxArticlesPerStartUrl
integerOptional
Maximum number of articles scraped per start URL.
Max concurrency
maxConcurrency
integerOptional
You can limit the speed of the scraper to avoid getting blocked.
Override proxy group
overrideProxyGroup
stringOptional
If you want to override the default proxy group, you can specify it here. This is useful if you want to use a different proxy group for each crawler.
Use browser (Puppeteer)
useBrowser
booleanOptional
This option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.
Default value of this property is false
Wait on each page (ms)
pageWaitMs
integerOptional
How many milliseconds to wait on each page before extracting data
Wait until navigation event is finished
navigationWaitUntil
EnumOptional
What to wait until the navigation is finished. domcontentloaded
happens when initial HTML loads and is fastest. load
happens when JS is executed and it is default. networkidle0
, networkidle2
wait for background network but cannot cause infinite loading.
Value options:
"load": string"domcontentloaded": string"networkidle0": string"networkidle2": string
Default value of this property is "load"
Wait for selector on each category page
pageWaitSelectorCategory
stringOptional
For what selector to wait on each page before extracting data
Wait for selector on each article page
pageWaitSelectorArticle
stringOptional
For what selector to wait on each page before extracting data
Scroll to bottom of the page (infinite scroll)
scrollToBottom
booleanOptional
Scroll to the botton of the page, loading dynamic articles.
Scroll to bottom button selector
scrollToBottomButtonSelector
stringOptional
CSS selector for a button to load more articles
Scroll to bottom max seconds
scrollToBottomMaxSecs
integerOptional
Limit for how long the scrolling can run so it does no go infinite.
Extend output function
extendOutputFunction
stringOptional
This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.
Limit CU consumption
stopAfterCUs
integerOptional
The scraper will stop running after reaching this number of compute units.
Emails address for notifications
notificationEmails
arrayOptional
Notifications will be sent to these email addresses.
Actor Metrics
200 monthly users
-
65 stars
>99% runs succeeded
1.2 days response time
Created in Nov 2019
Modified 4 months ago