Playwright Scraper avatar
Playwright Scraper

Pricing

Pay per usage

Go to Store
Playwright Scraper

Playwright Scraper

Developed by

Apify

Apify

Maintained by Apify

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

4.3 (7)

Pricing

Pay per usage

36

Total users

1.4k

Monthly users

233

Runs succeeded

99%

Issue response

8.9 days

Last modified

24 days ago

OC

When trying to scrape a sitemap.xml - getting back a "document.body is null" error

Closed

oren_clearya opened this issue
7 months ago

When running the Playwright Actor with a startUrl which is a sitemap XML - getting back the following error:

2024-10-19T14:54:41.339Z DEBUG PlaywrightCrawler:AutoscaledPool: scaling up {"oldConcurrency":2,"newConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2024-10-19T14:54:49.152Z WARN PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.evaluate: document.body is null
2024-10-19T14:54:49.153Z @debugger eval code line 226 > eval:1:7
2024-10-19T14:54:49.154Z evaluate@debugger eval code:228:17
2024-10-19T14:54:49.155Z @debugger eval code:1:44
2024-10-19T14:54:49.155Z
2024-10-19T14:54:49.156Z at CrawlerSetup._requestHandler (/home/myuser/dist/internals/crawler_setup.js:379:35) {"id":"vwv0onJJ2YlCPdo","url":"https://apify.com/sitemap.xml","retryCount":1}

It seems to fail before reaching the page function itself. However, here is the pageFunction that was used:

async function pageFunction(context) {
const { page, request, log } = context;
async function pageEvaluate(context) {
return {
url: document.URL,
html: document.body?.innerHTML ?? document.querySelector('urlset')?.innerHTML,
};
}
let data = await page.evaluate(pageEva... [trimmed]
jindrich.bar avatar

Hello @oren_clearya,

Thank you for bringing this issue to our attention, and I apologize for the delayed response. I attempted to replicate this using a similar setup, and scraping the XML document worked as expected on my end. Unfortunately, the linked run has expired, so I cannot investigate further or reproduce the issue.

It’s possible that the error was caused by a temporary issue or specific conditions during the run. If you encounter this problem again, please provide a fresh run link and any additional context, and we’ll be happy to assist further.

I’ll close this issue for now, but don’t hesitate to create a new one if needed. Thank you for your understanding!