Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (23)

Pricing

Pay per usage

864

Total users

88K

Monthly users

4.6K

Runs succeeded

>99%

Issues response

10 days

Last modified

a month ago

abcmallorca avatar

Cannot proccess 13,6 mb file, out of a sudden

Closed

Kaloyan Pavlov (abcmallorca) opened this issue
a year ago

Hello, I have the following issues, the actor is working good for me and out of a sudden when one of the websites provide a larger XML file due to more data inside, apify cannot process it anymore, I tried giving longer time and more memory and everything what I could even wait 40min but nothing helps, for other smaller size XML files it still works good, how I can debug this and what can be the cure to my problem do you know?

adamek avatar

Hi and sorry for the wait. We were looking into this problem, but unfortunately we haven't found a way to make this work with web scraper. The timeouts you see in the logs should be pretty clear - you are trying to do stuff in the request handler that takes too long. Those timeouts can be already configured, web scraper has two options for this, one for the navigation timeout (so making the request and fetching the data) called pageLoadTimeoutSecs, the other for processing (your request handler function) called pageFunctionTimeoutSecs. In your case, the problem is in the first part that fails to load (and parse) such a huge file within the 60s limit, if you increase the pageLoadTimeoutSecs to 300s (so 5 minutes) you can get pass the error you see now, only to see another one which says Page crashed!. That one is what I was trying to deal with yesterday and unfortunately I haven't succeeded.

With that said, the problem comes from you trying to use a browser for this task (which has its own limits). I tried to run the same code with cheerio scraper and it works fine in there (since it just downloads the file directly, there is no browser involved). Would that be a solution for you?

Here is a run link for the cheerio version, had to do only a few minor changes to the code:

https://console.apify.com/view/runs/DsWobimMCAY1F9Xg2

I guess you won't be able to see into the input of my run, so here is the request handler code, I only had to change few small bi... [trimmed]

abcmallorca avatar

Thank you, the solution worked perfectly, the other actor works better for getting the XML content

jindrich.bar avatar

Closing due to inactivity (and also because it seems solved :))