Web Scraper avatar

Web Scraper

Try for free

No credit card required

Go to Store
Web Scraper

Web Scraper

apify/web-scraper
Try for free

No credit card required

Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

QQ

Can't get pagination working.

Closed
quarterly_quicklime opened this issue
25 days ago

I tried without timeout, with less pages, instead pseudo urls used glob urls, I was following the docs basically but it scrapes only the first page of the website. Althought it finds the button where it should click.

jindrich.bar avatar

Hi,

Thanks for your question! The issue is that links are being enqueued before the new page content has fully loaded. To fix this, you must wait for the new content to appear before proceeding. You can do this by adding this line:

await waitFor(`.search-result-article:nth-child(${i*10 + 1})`, { timeoutMillis });

at the end of your loop. Altogether, your Page Function (or at least the inner loop) should look something like this:

1for (let i = 1; i < 10; i++) {
2            log.info('Waiting for the "Mehr laden" button.');
3            try {
4                // Default timeout first time.
5                await waitFor('input[value="Mehr laden"]', { timeoutMillis });
6            } catch (err) {
7                // Ignore the timeout error.
8                log.info('Could not find the "Mehr laden", '
9                    + 'we\'ve reached the end.');
10                break;
11            }
12            log.info('Clicking the "Mehr laden" button.');
13            $(buttonSelector).click();
14
15            await waitFor(`.search-result-article:nth-child(${i*10 + 1})`, { timeoutMillis });
16        }

The :nth-child CSS pseudoclass ensures that the (nth*10+1st) (so 11th, 21st, 31st...) search result is loaded before you click the button for second, third... time.

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!

QQ

quarterly_quicklime

25 days ago

Thanks for your quick reply. I tested it already but it won't work.

https://console.apify.com/actors/runs/2LpcOC0aIFHE3skqG#log

jindrich.bar avatar

Sorry, I forgot to mention this. This is happening because the timeout you're setting in the Page Function is too short.

In my Run, I updated the code to timeoutMillis = 3000 (3 seconds), which makes the Actor work as expected. See my run here - I aborted it prematurely, but you can see that the Actor enqueues 101 requests.

Feel free to shoot us a message if anything else is unclear. Cheers!

QQ

quarterly_quicklime

25 days ago

Thank you, now all crystal clear. You are the MAN.

jindrich.bar avatar

I think this is caused by the condition in your for loop - see:

1... 
2for (i=1;i<11;i++) {
3...

If each button press (in each loop iteration) loads only 10 results, this code can load only 11 * 10 = 110 results. Try bumping this limit to a higher number or switching this for loop for a while loop (checking the existence of the button) or similar. You might also want to bump the Page Function timeout, so that the Actor manages to enqueue all the pages before it hits the timeout.

See my example run with a higher search results limit (30 iterations - this is just an arbitrary number higher than 25) and a longer Page Function timeout:

https://console.apify.com/view/runs/jPsIk8B85zdBwBzdB

And even from 111 results it leaves 3-4 always out, don't understand why.

I'm not too sure about this part, I don't see this in my run, though.

Hope this helped. Feel free to shoot us another message if anything else is unclear. Cheers!

QQ

quarterly_quicklime

10 days ago

You are so right, I should have known that sorry.

Thank you for your patience.

QQ

quarterly_quicklime

10 days ago

In this log I reproduced the error where 21 sites was not crawled at the end.

https://console.apify.com/actors/tasks/dGiZgjaGdyya7SYrm/runs/YHKnPmU6E4KRSxm5N#log

Developer
Maintained by Apify

Actor Metrics

  • 3.3k monthly users

  • 456 bookmarks

  • >99% runs succeeded

  • 4.8 days response time

  • Created in Mar 2019

  • Modified a month ago