data:image/s3,"s3://crabby-images/e09f3/e09f33c5b1972a00d590e13bbbce1aa2367cfe3d" alt="Web Scraper avatar"
Web Scraper
No credit card required
data:image/s3,"s3://crabby-images/e09f3/e09f33c5b1972a00d590e13bbbce1aa2367cfe3d" alt="Web Scraper"
Web Scraper
No credit card required
Crawls arbitrary websites using the Chrome browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.
Can't get pagination working.
I tried without timeout, with less pages, instead pseudo urls used glob urls, I was following the docs basically but it scrapes only the first page of the website. Althought it finds the button where it should click.
Hi,
Thanks for your question! The issue is that links are being enqueued before the new page content has fully loaded. To fix this, you must wait for the new content to appear before proceeding. You can do this by adding this line:
await waitFor(`.search-result-article:nth-child(${i*10 + 1})`, { timeoutMillis });
at the end of your loop. Altogether, your Page Function (or at least the inner loop) should look something like this:
1for (let i = 1; i < 10; i++) { 2 log.info('Waiting for the "Mehr laden" button.'); 3 try { 4 // Default timeout first time. 5 await waitFor('input[value="Mehr laden"]', { timeoutMillis }); 6 } catch (err) { 7 // Ignore the timeout error. 8 log.info('Could not find the "Mehr laden", ' 9 + 'we\'ve reached the end.'); 10 break; 11 } 12 log.info('Clicking the "Mehr laden" button.'); 13 $(buttonSelector).click(); 14 15 await waitFor(`.search-result-article:nth-child(${i*10 + 1})`, { timeoutMillis }); 16 }
The :nth-child
CSS pseudoclass ensures that the (nth*10+1st) (so 11th, 21st, 31st...) search result is loaded before you click the button for second, third... time.
I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!
quarterly_quicklime
Thanks for your quick reply. I tested it already but it won't work.
Sorry, I forgot to mention this. This is happening because the timeout you're setting in the Page Function is too short.
In my Run, I updated the code to timeoutMillis = 3000
(3 seconds), which makes the Actor work as expected. See my run here - I aborted it prematurely, but you can see that the Actor enqueues 101 requests.
Feel free to shoot us a message if anything else is unclear. Cheers!
quarterly_quicklime
Thank you, now all crystal clear. You are the MAN.
quarterly_quicklime
Can I ask you a question?
From the url above why the scraper only scrapes 111 results it should be 250? And even from 111 results it leaves 3-4 always out, don't understand why.
Thank you if you have time for me.
I think this is caused by the condition in your for loop - see:
1... 2for (i=1;i<11;i++) { 3...
If each button press (in each loop iteration) loads only 10 results, this code can load only 11 * 10 = 110 results. Try bumping this limit to a higher number or switching this for
loop for a while
loop (checking the existence of the button) or similar. You might also want to bump the Page Function timeout, so that the Actor manages to enqueue all the pages before it hits the timeout.
See my example run with a higher search results limit (30 iterations - this is just an arbitrary number higher than 25) and a longer Page Function timeout:
https://console.apify.com/view/runs/jPsIk8B85zdBwBzdB
And even from 111 results it leaves 3-4 always out, don't understand why.
I'm not too sure about this part, I don't see this in my run, though.
Hope this helped. Feel free to shoot us another message if anything else is unclear. Cheers!
quarterly_quicklime
You are so right, I should have known that sorry.
Thank you for your patience.
quarterly_quicklime
In this log I reproduced the error where 21 sites was not crawled at the end.
https://console.apify.com/actors/tasks/dGiZgjaGdyya7SYrm/runs/YHKnPmU6E4KRSxm5N#log
Actor Metrics
3.3k monthly users
-
456 bookmarks
>99% runs succeeded
4.8 days response time
Created in Mar 2019
Modified a month ago