Cheerio Scraper avatar
Cheerio Scraper
Try for free

No credit card required

View all Actors
Cheerio Scraper

Cheerio Scraper

apify/cheerio-scraper
Try for free

No credit card required

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

User avatar

Some sites failing with "session error"

Closed

dominicz opened this issue
8 months ago

This scraper works mostly fine, but for some sites (example https://uggrenew.com/) it fails almost immediately with the below error. Is there a way to fix this in the scraper? Or is this due to some setting in the site (I have access to the site so we might be able to fix that)? Help appreciated.

WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... 2023-09-16T03:11:30.669Z Proxy responded with 590 UPSTREAM502: 0 bytes 2023-09-16T03:11:30.670Z 2023-09-16T03:11:30.671Z {"id":"btKVLbPR2NzCFj5","url":"https://uggrenew.com/","retryCount":1}

User avatar

Hello, looking into the logs, this looks like some kind of a temporary server error or bad proxy luck. I just managed to run the actor with identical input without any problems.

If the server is dead, the crawler run will always fail - but in case this is caused by bad proxies, there are a few things you can try:

  • set the proxy setting to Automatic proxy
  • set the proxy rotation option to Use recommended settings
  • unset the proxy country (the crawler will have a larger proxy pool to pick from) - I see that this is the only thing you haven't tried yet :)

Unfortunately, without reproducing this issue, I cannot really provide more help right now.

Can you please go through the above steps and confirm whether they helped? Thank you!

User avatar

dominicz

8 months ago

Thanks for getting back to me!

I've followed your instructions, including unsetting the proxy country. Tried a couple of times (see for example runs QTDp6S4dk02iYvJos and lCJRIlByoOqf3Q0qb), but I keep getting this specific error. And it's just for https://uggrenew.com/ (this is a client site we need to scrape - all other client sites so far have worked fine)

So you're saying that running with the same inputs (eg for https://uggrenew.com/) works fine on your side? Let me know if you have a dataset_id that you could share. Could it be that there is some caching going on on my account which causes it to always fail after it failed once?

User avatar

Hello, sorry for taking longer - I have looked into your runs and tried running them under my account with the exact same input once again - here are the results (https://console.apify.com/view/runs/gFCbuRzozEINfZQce).

Looking into your account, I see that you have only two Proxy groups enabled (BUYPROXIES94952 and StaticUS3) - this is based on your plan. The "Automatic" proxy option can only choose from those two groups. I tried running the Cheerio Scraper under my account with those proxies and finally got the same error as you did 🎉

I can also see that you have the RESIDENTIAL proxies available. These are larger proxy groups with IP addresses from the consumer ranges, so they are very hard for webmasters to block. If you route the scraper traffic through these groups (by choosing Proxy and HTTP Configuration > Selected Proxies > RESIDENTIAL), you should finally be able to crawl the uggrenew.com page.

Keep in mind that the residential proxies come at a higher price than the datacenter ones, though. I would advice you to use them only when needed (just like in this case). You can learn more about the proxies and different pricing at https://apify.com/proxy.

Once again, sorry for the delay, thank you for your patience and let us know whether this has solved your issue. Thanks!

User avatar

Closing due to inactivity.

Developer
Maintained by Apify
Actor metrics
  • 399 monthly users
  • 99.8% runs succeeded
  • 0.4 days response time
  • Created in Apr 2019
  • Modified about 1 month ago