Puppeteer Scraper avatar

Puppeteer Scraper

Try for free

No credit card required

Go to Store
Puppeteer Scraper

Puppeteer Scraper

apify/puppeteer-scraper
Try for free

No credit card required

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Do you want to learn more about this Actor?

Get a demo
LT

Ignore URLs with certain query strings

Open

linen_torch opened this issue
9 months ago

The site i want to scrape has it in multiple languages all denoted with the query '?=hl'. Can i get the crawler to ignore these?

adamek avatar

By ignoring it you mean you want to skip enqueuing such URLs if they were already processed? Are you sure the URLs are otherwise the same? If so, you could trim the query parameter manually and provide uniqueKey explicitly when adding new requests to the queue (which you would need to do manually, inside your page function, while disabling the automatic enqueueing, e.g. by setting the selector option to some gibberish).

Alternatively, you could skip the pages inside the request handler.

Developer
Maintained by Apify

Actor Metrics

  • 369 monthly users

  • 67 stars

  • >99% runs succeeded

  • 22 days response time

  • Created in Apr 2019

  • Modified 6 months ago