Extended GPT Scraper avatar
Extended GPT Scraper
Try for free

No credit card required

View all Actors
Extended GPT Scraper

Extended GPT Scraper

drobnikj/extended-gpt-scraper
Try for free

No credit card required

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Do you want to learn more about this Actor?

Get a demo
KO

lot of problems with large crawls

Closed

knowing_omen opened this issue
a month ago

Hello,

If you could have a look at our actors log, because we have around 3000 websites to crawl, it goes well on the begining and after a lot of errors appears when crawling very basic websites. I'm ok with the ssl errors, but not the websites that have no problems.

Could you please try to find tune your script to avoid so many retries of websites that load correctly ?

KO

knowing_omen

a month ago

pwoChdGX9BndgCIrJ

KO

knowing_omen

a month ago

Anyone here ?

lukas.prusa avatar

Hi Hosting, thanks for opening this issue!

Yes, I can confirm that this is a bug in the scraper. Exactly as you say, this is a problem for large crawls.

Basically, the crawler tries to automatically scale up and down the requests concurrency, but it looks like for large enough crawls with a lot of memory, it scales too rapidly up and down periodically. We will investigate this and find a proper value for the scaling function so that it stays in an optimal speed.

Btw we will look into the SSL error, I think we could adjust the browser to ignore it, because that's most likely just a website security concern. Though I can't promise that, because it might not be possible to remove that from the browser.

I will keep you updated here, thanks!

KO

knowing_omen

a month ago

ok thank you, keep us in touch please, more large crawls to come

lukas.prusa avatar

Hi again, thanks for your patience!

We've just updated the scraper with both of the fixes :) It will now scale accordingly and ignore the HTTPS certificate errors for the broken websites. Note that it will scale up a little slower than previously, but at least it should not overshoot it anymore.

Try it out and let me know how it works, thanks!

Developer
Maintained by Apify
Actor metrics
  • 75 monthly users
  • 26 stars
  • 99.2% runs succeeded
  • 7.6 days response time
  • Created in Jun 2023
  • Modified 24 days ago