Smart Article Extractor
No credit card required
Smart Article Extractor
No credit card required
📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.
Do you want to learn more about this Actor?
Get a demoCan you make it possible to separate the output per original inputted URL? I want to pass multiple URL to the scraper instead of starting a run for each URL, but right now that's impossible because there is no effective we to know what startURL each article was scraped from
Hi Yannick,
Could please provide us with the detail of the run, either share a link or the input.
I just ran the actor and startUrl
field was there and working.
Hey, thank you for your reply. I think i did not describe it clearly. I mean this, if I put 2 URL in the startUrl Array, eg [facebook.com, Instagram.com], and the tool starts scraping both websites, there is no clear way to extract for each result if it came from facebook.com or instagram.com. I could of course take the link of each scraped article and figure out the original startUrl from there, but then if a website would post blogs on a completely different url it would not work anymore. Most helpful would be if there was a 'startUrl' column for each scraped article. Am I explaining it well now or perhaps not? Let me know :)
Thank you for making this tool, it works amazing.
Thanks, you have explained it well enough in the original suggestion.
Perhaps you didn't check the "all fields" option under your results, because startUrl
field is not present in the simplified view.
If by some bug, it's not present there, please share the run's link here. Thanks!
Hey,
Here is an input example.
{ "extendOutputFunction": "($) => {\n const result = {};\n // Uncomment to add a title to the output\n // result.pageTitle = $('title').text().trim();\n\n return result;\n}", "isUrlArticleDefinition": { "minDashes": 4, "hasDate": true, "linkIncludes": [ "article", "storyid", "?p=", "id=", "/fpss/track", ".html", "/content/" ] }, "maxArticlesPerCrawl": 30, "minWords": 30, "mustHaveDate": false, "onlyInsideArticles": true, "onlyNewArticles": false, "proxyConfiguration": { "useApifyProxy": true }, "saveHtml": false, "useBrowser": false, "useGoogleBotHeaders": false, "startUrls": [ { "url": "https://www.nova-incasso.nl/blogs-en-nieuws/" } ] }
Hi Yannick,
It seems you are using the old build 0.0.10. The new version is 1.0.68 and that contains the "startUrl". see https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa
Hey, the build is set to 'latest', so that's weird
Hey, I checked out https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa and i did not find a start URL in the output JSON
Hi, try checking the "all fields" option. It's not displayed under "overview".
The 'latest' build should not be there (the platform team needs to remove it manually because the version was deleted already), you need to use 'version-1'.
Is see it now! Under de 'referer' column. Thank you! :)
Referer is only the previous page that linked this one. But there is startUrl column as well.
Actor Metrics
197 monthly users
-
65 stars
>99% runs succeeded
1.2 days response time
Created in Nov 2019
Modified 4 months ago