Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (22)

Pricing

Pay per usage

665

Total users

81.2k

Monthly users

3.8k

Runs succeeded

>99%

Response time

21 days

Last modified

6 days ago

FC

Exclude globs not being applied

Closed

friendly_countryside opened this issue
a year ago

Despite being excluded, the web-scraper still seems to be trying to scan jpg and other graphic files.

"excludes": [ { "glob": "/**/*.{png,jpg,jpeg,pdf}" } ],

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

You are (literally) one step from the correct solution :) To disable enqueuing files with certain extensions, you can use the following glob: **/*.{png,jpg,jpeg,pdf}, i.e. without the leading slash. Note that you can also test your globs using the Test glob feature in Apify Console. This tiny tool replicates the glob resolution behavior from the Actor, so you can be sure that your globs are doing exactly what you expect them to do.

Did this answer your question? I'll close this issue now, but feel free to ask any additional questions in case of any problems. Thanks!

FC

friendly_countryside

a year ago

Hi Jindřich, thanks for the response.

That would suggest that the default value for the Exclude glob isn't correct!

jindrich.bar avatar

Oops, haha, I didn't notice that this is indeed the default setting :) The point is, that the default setting only disables the enqueuing of files with relative URLs (or, more precisely - only URLs that begin with a forward slash).

While this might remove some of the png / jpg / jpeg / pdf files from the crawl, it won't (as you have noticed in your run) remove all of such files. To do so, I recommend using the glob without the leading slash.

Thank you for noticing though, I'll discuss this with our team and we'll most likely replace this example with something more usable. Cheers! :)

FC

friendly_countryside

a year ago

Sounds good, thanks!