Pricing

Pay per usage

Try for free

Go to Store

Cheerio Scraper

Try for free

Developed by

Apify

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

4.7 (10)

Pricing

Pay per usage

177

Total users

9.1K

Monthly users

924

Runs succeeded

>99%

Issues response

12 days

Last modified

2 months ago

Developer tools

Open source

Back to issues Create new issue

Some sites failing with "session error"

Closed

dominicz opened this issue

This scraper works mostly fine, but for some sites (example https://uggrenew.com/) it fails almost immediately with the below error. Is there a way to fix this in the scraper? Or is this due to some setting in the site (I have access to the site so we might be able to fix that)? Help appreciated.

WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... 2023-09-16T03:11:30.669Z Proxy responded with 590 UPSTREAM502: 0 bytes 2023-09-16T03:11:30.670Z 2023-09-16T03:11:30.671Z {"id":"btKVLbPR2NzCFj5","url":"https://uggrenew.com/","retryCount":1}

Jindřich Bär (jindrich.bar)

Hello, looking into the logs, this looks like some kind of a temporary server error or bad proxy luck. I just managed to run the actor with identical input without any problems.

If the server is dead, the crawler run will always fail - but in case this is caused by bad proxies, there are a few things you can try:

set the proxy setting to Automatic proxy
set the proxy rotation option to Use recommended settings
unset the proxy country (the crawler will have a larger proxy pool to pick from) - I see that this is the only thing you haven't tried yet :)

Unfortunately, without reproducing this issue, I cannot really provide more help right now.

Can you please go through the above steps and confirm whether they helped? Thank you!

dominicz

Thanks for getting back to me!

I've followed your instructions, including unsetting the proxy country. Tried a couple of times (see for example runs QTDp6S4dk02iYvJos and lCJRIlByoOqf3Q0qb), but I keep getting this specific error. And it's just for https://uggrenew.com/ (this is a client site we need to scrape - all other client sites so far have worked fine)

So you're saying that running with the same inputs (eg for https://uggrenew.com/) works fine on your side? Let me know if you have a dataset_id that you could share. Could it be that there is some caching going on on my account which causes it to always fail after it failed once?

Jindřich Bär (jindrich.bar)

Hello, sorry for taking longer - I have looked into your runs and tried running them under my account with the exact same input once again - here are the results (https://console.apify.com/view/runs/gFCbuRzozEINfZQce).

Looking into your account, I see that you have only two Proxy groups enabled (BUYPROXIES94952 and StaticUS3) - this is based on your plan. The "Automatic" proxy option can only choose from those two groups. I tried running the Cheerio Scraper under my account with those proxies and finally got the same error as you did 🎉

I can also see that you have the RESIDENTIAL proxies available. These are larger proxy groups with IP addresses from the consumer ranges, so they are very hard for webmasters to block. If you route the scraper traffic through these groups (by choosing Proxy and HTTP Configuration > Selected Proxies > RESIDENTIAL), you should finally be able to crawl the uggrenew.com page.

Keep in mind that the residential proxies come at a higher price than the datacenter ones, though. I would advice you to use them only when needed (just like in this case). You can learn more about the proxies and different pricing at https://apify.com/proxy.

Once again, sorry for the delay, thank you for your patience and let us know whether this has solved your issue. Thanks!

Jindřich Bär (jindrich.bar)

Closing due to inactivity.

Add comment

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

8.4K

5.0

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

90K

4.4

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

870

4.2

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

471

Playwright Scraper

apify/playwright-scraper

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

3.6

HTML Scraper pro

scrapingxpert/html-scraper-pro

The HTML Scraper Pro is a powerful tool designed to extract the HTML source code and metadata from websites. It uses advanced web scraping techniques to retrieve the full HTML content of web pages,page title and HTTP status code.This tool is ideal for data extraction, website analysis, and archiving

scrapingxpert

100

Camoufox Scraper

apify/camoufox-scraper

Crawls websites with stealthy Camoufox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

Javascript Library Detail Scraper

cykieffodh/javascript-library-detail-scraper

Javascript Library Detail Scraper

Michael Laflin

JSDOM Scraper

apify/jsdom-scraper

Parses the HTML using the JSDOM library, providing the same DOM API as browsers do (e.g. `window`). It is able to process client-side JavaScript without using a real browser. Performance-wise, it stands somewhere between the Cheerio Scraper and the browser scrapers.

Apify

4.3

Nodejs Runner

martin.forejt/nodejs-runner

This Actor allows you to quickly run arbitrary JavaScript code in a real Node.js environment, making it ideal for testing, debugging, or executing small scripts without setting up a local Node.js instance. The actor spawns a separate Node.js process to run the provided code and captures the logs.