Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.4 (23)

Pricing

Pay per usage

926

Total users

91K

Monthly users

5K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

2 months ago

GT

Crawling not working well

Closed

agat opened this issue
a year ago

I attempted to crawl the website https://jcyared.com, setting the maximum number of pages per crawl (maxPagesPerCrawl parameter) to 20. However, I only managed to retrieve 2 pages. Could someone explain why this might have occurred?

adamek avatar

there are two problems in your input:

  • you set maxCrawlingDepth to 0, which means nothing nested will be enqueued
  • you set the globs to https://jcyared.com/* which means no nesting as well (as this accepts https://jcyared.com/foo but not https://jcyared.com/foo/bar), you want https://jcyared.com/** to allow multiple slashes in the URL path

The second one is the important bit. Here is a run with those two fixed, which seems to work as expected (I've aborted it after a few minutes but it went through more than 40 pages already):

https://console.apify.com/view/runs/pDfb703n0fEdaUHyv