Smart Article Extractor avatar
Smart Article Extractor
Try for free

No credit card required

View all Actors
Smart Article Extractor

Smart Article Extractor

lukaskrivka/article-extractor-smart
Try for free

No credit card required

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

User avatar

Make it possible to know startURL of scraped article

Closed

ybierens opened this issue
10 months ago

Can you make it possible to separate the output per original inputted URL? I want to pass multiple URL to the scraper instead of starting a run for each URL, but right now that's impossible because there is no effective we to know what startURL each article was scraped from

User avatar

Patai5

10 months ago

Hi Yannick, Could please provide us with the detail of the run, either share a link or the input. I just ran the actor and startUrl field was there and working.

User avatar

ybierens

10 months ago

Hey, thank you for your reply. I think i did not describe it clearly. I mean this, if I put 2 URL in the startUrl Array, eg [facebook.com, Instagram.com], and the tool starts scraping both websites, there is no clear way to extract for each result if it came from facebook.com or instagram.com. I could of course take the link of each scraped article and figure out the original startUrl from there, but then if a website would post blogs on a completely different url it would not work anymore. Most helpful would be if there was a 'startUrl' column for each scraped article. Am I explaining it well now or perhaps not? Let me know :)

Thank you for making this tool, it works amazing.

User avatar

Patai5

10 months ago

Thanks, you have explained it well enough in the original suggestion. Perhaps you didn't check the "all fields" option under your results, because startUrl field is not present in the simplified view. If by some bug, it's not present there, please share the run's link here. Thanks!

User avatar

ybierens

10 months ago

Hey,

Here is an input example.

{ "extendOutputFunction": "($) => {\n const result = {};\n // Uncomment to add a title to the output\n // result.pageTitle = $('title').text().trim();\n\n return result;\n}", "isUrlArticleDefinition": { "minDashes": 4, "hasDate": true, "linkIncludes": [ "article", "storyid", "?p=", "id=", "/fpss/track", ".html", "/content/" ] }, "maxArticlesPerCrawl": 30, "minWords": 30, "mustHaveDate": false, "onlyInsideArticles": true, "onlyNewArticles": false, "proxyConfiguration": { "useApifyProxy": true }, "saveHtml": false, "useBrowser": false, "useGoogleBotHeaders": false, "startUrls": [ { "url": "https://www.nova-incasso.nl/blogs-en-nieuws/" } ] }

User avatar

Hi Yannick,

It seems you are using the old build 0.0.10. The new version is 1.0.68 and that contains the "startUrl". see https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa

User avatar

ybierens

10 months ago

Hey, the build is set to 'latest', so that's weird

User avatar

ybierens

10 months ago

Hey, I checked out https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa and i did not find a start URL in the output JSON

User avatar

Patai5

10 months ago

Hi, try checking the "all fields" option. It's not displayed under "overview".

User avatar

The 'latest' build should not be there (the platform team needs to remove it manually because the version was deleted already), you need to use 'version-1'.

User avatar

ybierens

10 months ago

Is see it now! Under de 'referer' column. Thank you! :)

User avatar

Referer is only the previous page that linked this one. But there is startUrl column as well.

Developer
Maintained by Apify
Actor metrics
  • 185 monthly users
  • 100.0% runs succeeded
  • 2.8 days response time
  • Created in Nov 2019
  • Modified about 2 months ago
Categories