Web Scraper avatar
Web Scraper

Pricing

Pay per usage

Go to Store
Web Scraper

Web Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

4.5 (23)

Pricing

Pay per usage

854

Total users

88K

Monthly users

4.5K

Runs succeeded

>99%

Issues response

9 days

Last modified

a month ago

CO

Crawl is stopping after 40 listings with no error message

Closed

ColabReggie opened this issue
a year ago

Crawl is stopping after 40 listings with no error message even though there are much more (over 800). I don't understand the issue. There are only 10 listings per page. So, if it was a pagination issue, I would think it would only pull those 10. And if it was an issue with the code then it wouldn't pull any or at least have an error.

jindrich.bar avatar

Hello @ColabReggie and thank you for your interest in this Actor!

It is a pagination issue - the Actor never visits any index page other than the first one. All the additional (~30) listing results are being enqueued from the "You May Also Be Interested In" sections of the (first 10) listings (see e.g. https://medspa.com/listing/removery-denver and scroll all the way down).

There are multiple ways of scraping such websites with this Actor, I'll present the one I consider the easiest and most flexible:

Note: If you're in a hurry, I fixed your code in this run - feel free to copy the input :)

  • Remove the Link Selector (leave the field blank). Because of different page types, we'll handle the link enqueueing ourselves.
  • For the first start URL, click the button that says Advanced and add {"label": "START"} to the user data field. This way, we'll be able to tell the index page apart from the actual listings.
  • In your page function, you can do something like:
async function pageFunction(context) {
const { url, userData: { label } } = context.request;
const $ = context.jQuery;
const log = context.log.info;
if(label === 'START') {
log('Scraping the index page on ' + url);
// Find and enqueue all the listings from the current page
const listingsOnThisPage = $('.lf-item > a').map((_, element) => $(element).attr('href'));
for (let listingUrl of ... [trimmed]