Web Scraper avatar
Web Scraper
Try for free

No credit card required

View all Actors
Web Scraper

Web Scraper

apify/web-scraper
Try for free

No credit card required

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

User avatar

Cannot proccess 13,6 mb file, out of a sudden

Closed

Kaloyan Pavlov (abcmallorca) opened this issue
3 months ago

Hello, I have the following issues, the actor is working good for me and out of a sudden when one of the websites provide a larger XML file due to more data inside, apify cannot process it anymore, I tried giving longer time and more memory and everything what I could even wait 40min but nothing helps, for other smaller size XML files it still works good, how I can debug this and what can be the cure to my problem do you know?

User avatar

Hi and sorry for the wait. We were looking into this problem, but unfortunately we haven't found a way to make this work with web scraper. The timeouts you see in the logs should be pretty clear - you are trying to do stuff in the request handler that takes too long. Those timeouts can be already configured, web scraper has two options for this, one for the navigation timeout (so making the request and fetching the data) called pageLoadTimeoutSecs, the other for processing (your request handler function) called pageFunctionTimeoutSecs. In your case, the problem is in the first part that fails to load (and parse) such a huge file within the 60s limit, if you increase the pageLoadTimeoutSecs to 300s (so 5 minutes) you can get pass the error you see now, only to see another one which says Page crashed!. That one is what I was trying to deal with yesterday and unfortunately I haven't succeeded.

With that said, the problem comes from you trying to use a browser for this task (which has its own limits). I tried to run the same code with cheerio scraper and it works fine in there (since it just downloads the file directly, there is no browser involved). Would that be a solution for you?

Here is a run link for the cheerio version, had to do only a few minor changes to the code:

https://console.apify.com/view/runs/DsWobimMCAY1F9Xg2

I guess you won't be able to see into the input of my run, so here is the request handler code, I only had to change few small bits like getting the jquery handle directly from context and using ctx.request.loadedUrl instead of document.location.

1pageFunction: async function pageFunction(context) {
2        const languageList = ["EN-GB", "ES-ES", "DE-DE"];
3        let returnList = [];
4        const $ = context.$;
5        let isOrigin = context.request.loadedUrl.includes('abc-mallorca');
6        if(isOrigin) {
7            let xmlLink = $('body').first().text();
8            context.enqueueRequest({ url: xmlLink });
9            return [];
10        } else {
11            $('properties property').each(function(key, item) {
12                let $property = $(item);
13                languageList.forEach(function(lang) {
14                    let propertyData = mapPropertyData(lang, $property);
15                    //for debugging uncomment below
16                    // console.log(propertyData);
17                    returnList.push(propertyData)
18                });
19            });
20        }
21
22        // This is where we need to map the data to our new object....
23        function mapPropertyData(lang, $property) {
24
25            let propertyData = {};
26            propertyData.lang = lang.substr(0,2).toLowerCase();
27            propertyData.reference = $property.find('reference').text();
28            propertyData.location = $property.find('location > civilparish').text() || "Mallorca";
29            propertyData.type = $property.find('type').text() || "Property"; //Apartment, Land or Villa?
30            propertyData.price = $property.find('price > Value').text() || "";
31            propertyData.area = $property.find('areas > plot').text();
32            // propertyData.terrace = $property.find('terracesize').text();
33            propertyData.constructedarea = $property.find('areas > used').text(); // or switch
34            propertyData.bedrooms = $property.find("features > feature[name='Bedrooms']").text();
35            propertyData.bathrooms = $property.find("features > feature[name='Bathrooms']").text();
36            propertyData.year = $property.find('year').text().replace(/0/,"");
37            propertyData.renovated = $property.find('condition > entry').text();
38            propertyData.frequency = "";
39            propertyData.features = $property.find("features > feature[type='BOOL']").map(function() {
40                return $(this).attr('name');
41            }).toArray().filter(e=>e);
42            propertyData.energycert =  $property.find('energy').text();
43
44            propertyData.shortdesc = $property.find("title > entry[language='"+ lang +"']").text();
45            var refer = "#ref:" + propertyData.reference;
46            propertyData.longdesc = $property.find("description > entry[language='"+ lang +"']").text().replace(refer,"");
47
48            propertyData.images = $property.find("media url").map(function() {
49                return $(this).text();
50            }).toArray().filter(e=>e);
51
52            let itemWrapper = {};
53            itemWrapper.id = propertyData.reference;
54            itemWrapper.url = $property.find('> url').text();
55            itemWrapper.pageFunctionResult = propertyData;
56            return itemWrapper;
57        }
58
59        context.log.info("THIS IS THE returnList with a size of: " + returnList.length + " elements");
60        return returnList;
61    },
User avatar

Thank you, the solution worked perfectly, the other actor works better for getting the XML content

User avatar

Closing due to inactivity (and also because it seems solved :))

Developer
Maintained by Apify
Actor metrics
  • 3.7k monthly users
  • 98.8% runs succeeded
  • 3.6 days response time
  • Created in Mar 2019
  • Modified about 1 month ago