Pricing

Pay per usage

Try for free

Go to Store

Playwright Scraper

Try for free

Developed by

Apify

Crawls websites with the headless Chromium, Chrome, or Firefox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

4.3 (7)

Pricing

Pay per usage

Total users

Monthly users

324

Runs succeeded

97%

Issues response

7.4 days

Last modified

2 months ago

Developer tools

Open source

Back to issues Create new issue

When trying to scrape a sitemap.xml - getting back a "document.body is null" error

Closed

oren_clearya opened this issue

When running the Playwright Actor with a startUrl which is a sitemap XML - getting back the following error:

2024-10-19T14:54:41.339Z DEBUG PlaywrightCrawler:AutoscaledPool: scaling up {"oldConcurrency":2,"newConcurrency":3,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
2024-10-19T14:54:49.152Z WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. page.evaluate: document.body is null
2024-10-19T14:54:49.153Z @debugger eval code line 226 > eval:1:7
2024-10-19T14:54:49.154Z evaluate@debugger eval code:228:17
2024-10-19T14:54:49.155Z @debugger eval code:1:44
2024-10-19T14:54:49.155Z
2024-10-19T14:54:49.156Z     at CrawlerSetup._requestHandler (/home/myuser/dist/internals/crawler_setup.js:379:35) {"id":"vwv0onJJ2YlCPdo","url":"https://apify.com/sitemap.xml","retryCount":1}

It seems to fail before reaching the page function itself. However, here is the pageFunction that was used:

async function pageFunction(context) {
  const { page, request, log } = context;

  async function pageEvaluate(context) {
    return {
      url: document.URL,
      html: document.body?.innerHTML ?? document.querySelector('urlset')?.innerHTML,
    };
  }

  let data = await page.evaluate(pageEva... [trimmed]

Jindřich Bär (jindrich.bar)

Hello @oren_clearya,

Thank you for bringing this issue to our attention, and I apologize for the delayed response. I attempted to replicate this using a similar setup, and scraping the XML document worked as expected on my end. Unfortunately, the linked run has expired, so I cannot investigate further or reproduce the issue.

It’s possible that the error was caused by a temporary issue or specific conditions during the run. If you encounter this problem again, please provide a fresh run link and any additional context, and we’ll be happy to assist further.

I’ll close this issue for now, but don’t hesitate to create a new one if needed. Thank you for your understanding!

Add comment

Puppeteer Scraper

apify/puppeteer-scraper

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

Apify

8.4K

5.0

Camoufox Scraper

apify/camoufox-scraper

Crawls websites with stealthy Camoufox browser and Playwright library using a provided server-side Node.js code. Supports both recursive crawling and a list of URLs. Supports login to a website.

Apify

Cheerio Scraper

apify/cheerio-scraper

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

Apify

9.1K

4.7

Example Code Runner (Playwright)

apify/example-code-runner-playwright

Generic Actor to run code examples from the documentation via "Run on Apify" links.

Apify

1.4K

4.7

Vanilla JS Scraper

mstephen190/vanilla-js-scraper

Scrape the web using familiar JavaScript methods! Crawls websites using raw HTTP requests, parses the HTML with the JSDOM package, and extracts data from the pages using Node.js code. Supports both recursive crawling and lists of URLs. This actor is a non jQuery alternative to CheerioScraper.

Matthias Stephens

471

Playwright Test Runner

jindrich.bar/playwright-test

Run Playwright tests across numerous browser configurations with Apify. Create your tests in seconds and get comprehensive test reports faster than ever.

Jindřich Bär

Sunbiz

app/sunbiz

app Premiumstaysrentals

Web Scraper

apify/web-scraper

Crawls arbitrary websites using a web browser and extracts structured data from web pages using a provided JavaScript function. The Actor supports both recursive crawling and lists of URLs, and automatically manages concurrency for maximum performance.

Apify

90K

4.4

Legacy PhantomJS Crawler

apify/legacy-phantomjs-crawler

Replacement for the legacy Apify Crawler product with a backward-compatible interface. The actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.

Apify

1.6K

5.0

BeautifulSoup Scraper

apify/beautifulsoup-scraper

Crawls websites using raw HTTP requests. It parses the HTML with the BeautifulSoup library and extracts data from the pages using Python code. Supports both recursive crawling and lists of URLs. This Actor is a Python alternative to Cheerio Scraper.

Apify

870

4.2