Pricing

Pay per usage

Go to Store

Smart Article Extractor

Try for free

Developed by

Lukáš Křivka

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

4.7 (6)

Pricing

Pay per usage

115

Total users

4.9k

Monthly users

347

Runs succeeded

>99%

Issue response

3.5 days

Last modified

a month ago

News

Back to issues Create new issue

Make it possible to know startURL of scraped article

Closed

ybierens opened this issue

Can you make it possible to separate the output per original inputted URL? I want to pass multiple URL to the scraper instead of starting a run for each URL, but right now that's impossible because there is no effective we to know what startURL each article was scraped from

Patai5

Hi Yannick, Could please provide us with the detail of the run, either share a link or the input. I just ran the actor and startUrl field was there and working.

ybierens

Hey, thank you for your reply. I think i did not describe it clearly. I mean this, if I put 2 URL in the startUrl Array, eg [facebook.com, Instagram.com], and the tool starts scraping both websites, there is no clear way to extract for each result if it came from facebook.com or instagram.com. I could of course take the link of each scraped article and figure out the original startUrl from there, but then if a website would post blogs on a completely different url it would not work anymore. Most helpful would be if there was a 'startUrl' column for each scraped article. Am I explaining it well now or perhaps not? Let me know :)

Thank you for making this tool, it works amazing.

Patai5

Thanks, you have explained it well enough in the original suggestion. Perhaps you didn't check the "all fields" option under your results, because startUrl field is not present in the simplified view. If by some bug, it's not present there, please share the run's link here. Thanks!

ybierens

Hey,

Here is an input example.

{ "extendOutputFunction": "($) => {\n const result = {};\n // Uncomment to add a title to the output\n // result.pageTitle = $('title').text().trim();\n\n return result;\n}", "isUrlArticleDefinition": { "minDashes": 4, "hasDate": true, "linkIncludes": [ "article", "storyid", "?p=", "id=", "/fpss/track", ".html", "/content/" ] }, "maxArticlesPerCrawl": 30, "minWords": 30, "mustHaveDate": false, "onlyInsideArticles": true, "onlyNewArticles": false, "proxyConfiguration": { "useApifyProxy": true }, "saveHtml": false, "useBrowser": false, "useGoogleBotHeaders": false, "startUrls": [ { "url": "https://www.nova-incasso.nl/blogs-en-nieuws/" } ] }

Lukáš Křivka (lukaskrivka)

Hi Yannick,

It seems you are using the old build 0.0.10. The new version is 1.0.68 and that contains the "startUrl". see https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa

ybierens

Hey, the build is set to 'latest', so that's weird

ybierens

Hey, I checked out https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa and i did not find a start URL in the output JSON

Patai5

Hi, try checking the "all fields" option. It's not displayed under "overview".

Lukáš Křivka (lukaskrivka)

The 'latest' build should not be there (the platform team needs to remove it manually because the version was deleted already), you need to use 'version-1'.

ybierens

Is see it now! Under de 'referer' column. Thank you! :)

Lukáš Křivka (lukaskrivka)

Referer is only the previous page that linked this one. But there is startUrl column as well.

Add comment

News Website Crawler & Article Extractor

xtech/news-source-crawler

Scrape all articles from any news website. Extract full text, metadata, keywords, and summaries. Ideal for content analysis, research, and news aggregation.

Xtech

Articles Extractor

web.harvester/articles-extractor

The Article Extractor is an enterprise-grade web scraping solution designed specifically for extracting structured data from news articles, blog posts, and online publications. Our advanced HTML parsing engine delivers unmatched accuracy in content extraction across thousands of websites.

Web Harvester

470

Ultimate Articles Extractor

web.harvester/ultimate-articles-extractor

A powerful and modular web scraping tool designed to extract content from any webpage, article, or news site. Get clean, structured data from any website with optimized extraction algorithms, anti-bot detection avoidance, and proxy support.

Web Harvester

Article Content Extractor 📄

easyapi/article-content-extractor

Extract clean article content, metadata and structured information from any web page. Supports multiple URLs and returns well-formatted JSON with title, description, content, author, publish date and more. 🔍📄

EasyApi

Smart Article Scraper - Text, Data & Insights

xtech/article-extractor

Unlock valuable insights from any article! Get clean text, publication data, keywords, summaries, and more. Ideal for research, content marketing, and competitive analysis. Fast, reliable, and easy to use.

Xtech

News Article Scraper for Feeding LLM

proscraper/newsarticlescraper

Scrape news articles metadata to feed into LLM models. Returns article body, published date, article title, author etc.

Owais Nazir

Tech News Article Scraper

inquisitive_sarangi/news-article-scraper

Tech News Article Scraper is a simple yet powerful tool to extract news articles from a variety of popular news websites. Supported The Verge, CNET, Wired, TechCrunch, Ars Technica

API Master

Article Text Extractor

mtrunkat/article-text-extractor

Simply extracts article texts and other meta info from the given URL. Uses https://github.com/ageitgey/node-unfluff which is a NodeJS implementation of https://github.com/grangier/python-goose.

Marek Trunkát

🤖 Any Website URL to Article Summarizer

easyapi/any-website-url-to-article-summarizer

Transform any article, blog post, or web content into concise, AI-powered summaries. Get key insights and main points instantly with smart text analysis and markdown formatting. Perfect for researchers, content creators, and busy professionals who need quick, accurate content digests.

EasyApi

Ultimate News API

glitch_404/Ultimate-News-Scraper

news scraper to scrape up to 10K news articles from over 1000 news sources in less than 20 minutes news from over 20 categories .e.g. Crypto news, World News, Latest News, Celebrities News, and a lot more. you can get news from websites like Fox News, BBC News, CNN News, Crypto and Cryptocurrencies.