
Web Scraper
Pricing
Pay per usage

Web Scraper
Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.
4.5 (22)
Pricing
Pay per usage
665
Total users
81.2k
Monthly users
3.8k
Runs succeeded
>99%
Response time
21 days
Last modified
6 days ago
Exclude globs not being applied
Closed
Despite being excluded, the web-scraper still seems to be trying to scan jpg and other graphic files.
"excludes": [ { "glob": "/**/*.{png,jpg,jpeg,pdf}" } ],
Hello and thank you for your interest in this Actor!
You are (literally) one step from the correct solution :) To disable enqueuing files with certain extensions, you can use the following glob: **/*.{png,jpg,jpeg,pdf}
, i.e. without the leading slash
.
Note that you can also test your globs using the Test glob
feature in Apify Console. This tiny tool replicates the glob resolution behavior from the Actor, so you can be sure that your globs are doing exactly what you expect them to do.
Did this answer your question? I'll close this issue now, but feel free to ask any additional questions in case of any problems. Thanks!
friendly_countryside
Hi Jindřich, thanks for the response.
That would suggest that the default value for the Exclude glob isn't correct!
Oops, haha, I didn't notice that this is indeed the default setting :) The point is, that the default setting only disables the enqueuing of files with relative URLs
(or, more precisely - only URLs that begin with a forward slash).
While this might remove some of the png / jpg / jpeg / pdf
files from the crawl, it won't (as you have noticed in your run) remove all of such files. To do so, I recommend using the glob without the leading slash.
Thank you for noticing though, I'll discuss this with our team and we'll most likely replace this example with something more usable. Cheers! :)
friendly_countryside
Sounds good, thanks!