Web Scraper avatar
Web Scraper
Try for free

No credit card required

View all Actors
Web Scraper

Web Scraper

apify/web-scraper
Try for free

No credit card required

Crawls arbitrary websites using the Chrome browser and extracts data from pages using a provided JavaScript code. The actor supports both recursive crawling and lists of URLs and automatically manages concurrency for maximum performance. This is Apify's basic tool for web crawling and scraping.

User avatar

Crawl is stopping after 40 listings with no error message

Closed

ColabReggie opened this issue
4 months ago

Crawl is stopping after 40 listings with no error message even though there are much more (over 800). I don't understand the issue. There are only 10 listings per page. So, if it was a pagination issue, I would think it would only pull those 10. And if it was an issue with the code then it wouldn't pull any or at least have an error.

User avatar

Hello @ColabReggie and thank you for your interest in this Actor!

It is a pagination issue - the Actor never visits any index page other than the first one. All the additional (~30) listing results are being enqueued from the "You May Also Be Interested In" sections of the (first 10) listings (see e.g. https://medspa.com/listing/removery-denver and scroll all the way down).

There are multiple ways of scraping such websites with this Actor, I'll present the one I consider the easiest and most flexible:

Note: If you're in a hurry, I fixed your code in this run - feel free to copy the input :)

  • Remove the Link Selector (leave the field blank). Because of different page types, we'll handle the link enqueueing ourselves.
  • For the first start URL, click the button that says Advanced and add {"label": "START"} to the user data field. This way, we'll be able to tell the index page apart from the actual listings.
  • In your page function, you can do something like:
1async function pageFunction(context) {
2    const { url, userData: { label } } = context.request;
3    const $ = context.jQuery;
4    const log = context.log.info;
5
6    if(label === 'START') {
7      log('Scraping the index page on ' + url);
8      // Find and enqueue all the listings from the current page
9      const listingsOnThisPage = $('.lf-item > a').map((_, element) => $(element).attr('href'));
10
11      for (let listingUrl of listingsOnThisPage) {
12        context.enqueueRequest({
13          url: listingUrl,
14          label: 'LISTING'
15        });
16      }
17
18      log('Added ' + listingsOnThisPage.length + ' listings');
19
20      // Enqueue the next "index" page
21      const nextIndexPageUrl = $('[rel=next]').first().attr('href');
22
23      context.enqueueRequest({
24        url: nextIndexPageUrl,
25        label: 'START'
26      });
27
28      log('The next index page is on ' + nextIndexPageUrl);
29    } 
30    else if (label === 'LISTING') {
31      log('Scraping a listing page on ' + URL);
32      // ... your data extraction code

You see that for pages with label == START (you can rename this), we are enqueueing the listings (and the following index page), and for listings (with label == LISTING), we are extracting data (without enqueuing anything).

You can check out the actual code in my example run here (https://console.apify.com/view/runs/Ew514jolIS8Jk9gxO) - I aborted it prematurely, but it seems to be working just fine.

I'll close this issue now, but feel free to reopen it if you have any additional questions regarding this Actor. Thanks!

Developer
Maintained by Apify
Actor metrics
  • 3.4k monthly users
  • 99.9% runs succeeded
  • 3.2 days response time
  • Created in Mar 2019
  • Modified about 2 months ago