Playwright Scraper avatar
Playwright Scraper

Pricing

Pay per usage

Go to Store
Playwright Scraper

Playwright Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

4.3 (7)

Pricing

Pay per usage

45

Total users

1.7K

Monthly users

303

Runs succeeded

97%

Issues response

33 days

Last modified

a month ago

TR

Enqueued links not processed

Closed

trivo opened this issue
11 days ago

We've had some cases of different websites where only the homepage (start url) is scraped even though links get enqueued but they aren't followed or processed. The actor stops when done with the homepage.

There're no errors or warnings within the logs.

Here are some run IDs where it happened:

  • DlbeLbxFkz3lpwGi4
  • wIOBtIvVin5ntFkG8
  • rRp1RWW1A7vQ8vhWc
jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

A large part of what you implemented in your Page function is actually already in Playwright Scraper (or Crawlee).

The following snippet is actually identical to your implementation with transformRequestFunction:

await enqueueLinks({
selector: "a",
strategy: 'same-domain',
exclude: [
/\.(docx?|pdf|webp|jpe?g|gif|png|php|asp)$/i,
/blog|archive|arhiv/i
],
});

If you want to stay with your implementation, you absolutely can - the issue is that by default, Crawlee uses strategy: 'same-hostname' (source here), which matches 0 links on the first page, so the Actor finishes early. You can pass strategy: 'all' to enqueueLinks so that Crawlee doesn't filter the links prematurely and passes all the links to your transform function:

await enqueueLinks({
selector: "a",
strategy: 'all',
transformRequestFunction: (req) => {
// your transformRequestFunction

I'll close this issue now, but feel free to ask additional questions if you have any. Cheers!