Smart Article Extractor avatar
Smart Article Extractor
Try for free

No credit card required

View all Actors
Smart Article Extractor

Smart Article Extractor

lukaskrivka/article-extractor-smart
Try for free

No credit card required

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

YB

Make it possible to know startURL of scraped article

Closed

ybierens opened this issue
a year ago

Can you make it possible to separate the output per original inputted URL? I want to pass multiple URL to the scraper instead of starting a run for each URL, but right now that's impossible because there is no effective we to know what startURL each article was scraped from

PT

Patai5

a year ago

Hi Yannick, Could please provide us with the detail of the run, either share a link or the input. I just ran the actor and startUrl field was there and working.

YB

ybierens

a year ago

Hey, thank you for your reply. I think i did not describe it clearly. I mean this, if I put 2 URL in the startUrl Array, eg [facebook.com, Instagram.com], and the tool starts scraping both websites, there is no clear way to extract for each result if it came from facebook.com or instagram.com. I could of course take the link of each scraped article and figure out the original startUrl from there, but then if a website would post blogs on a completely different url it would not work anymore. Most helpful would be if there was a 'startUrl' column for each scraped article. Am I explaining it well now or perhaps not? Let me know :)

Thank you for making this tool, it works amazing.

PT

Patai5

a year ago

Thanks, you have explained it well enough in the original suggestion. Perhaps you didn't check the "all fields" option under your results, because startUrl field is not present in the simplified view. If by some bug, it's not present there, please share the run's link here. Thanks!

YB

ybierens

a year ago

Hey,

Here is an input example.

{ "extendOutputFunction": "($) => {\n const result = {};\n // Uncomment to add a title to the output\n // result.pageTitle = $('title').text().trim();\n\n return result;\n}", "isUrlArticleDefinition": { "minDashes": 4, "hasDate": true, "linkIncludes": [ "article", "storyid", "?p=", "id=", "/fpss/track", ".html", "/content/" ] }, "maxArticlesPerCrawl": 30, "minWords": 30, "mustHaveDate": false, "onlyInsideArticles": true, "onlyNewArticles": false, "proxyConfiguration": { "useApifyProxy": true }, "saveHtml": false, "useBrowser": false, "useGoogleBotHeaders": false, "startUrls": [ { "url": "https://www.nova-incasso.nl/blogs-en-nieuws/" } ] }

lukaskrivka avatar

Hi Yannick,

It seems you are using the old build 0.0.10. The new version is 1.0.68 and that contains the "startUrl". see https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa

YB

ybierens

a year ago

Hey, the build is set to 'latest', so that's weird

YB

ybierens

a year ago

Hey, I checked out https://console.apify.com/view/runs/zjlf4Sk2NK4k4qsNa and i did not find a start URL in the output JSON

PT

Patai5

a year ago

Hi, try checking the "all fields" option. It's not displayed under "overview".

lukaskrivka avatar

The 'latest' build should not be there (the platform team needs to remove it manually because the version was deleted already), you need to use 'version-1'.

YB

ybierens

a year ago

Is see it now! Under de 'referer' column. Thank you! :)

lukaskrivka avatar

Referer is only the previous page that linked this one. But there is startUrl column as well.

Developer
Maintained by Apify
Actor metrics
  • 194 monthly users
  • 47 stars
  • 99.9% runs succeeded
  • 1.9 days response time
  • Created in Nov 2019
  • Modified 15 days ago
Categories