Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

User avatar

Remove cookies not behaving as expected

Closed

embrace_ai opened this issue
2 months ago

We noticed that many of the pages in one of our crawls had the "cookie warning" as the extracted text. Our actor input contains the following settings:

{ "removeCookieWarnings": true, "crawlerType": "playwright:firefox" }

Are we missing another setting or is this a limitation of the "I don't care about cookies" browser extension? Thank you!

https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/

User avatar

Hello and thank you for your interest in this Actor!

Indeed, this seems like an issue with the browser extension we're using for this. You can still remove the cookie warning from your content manually, by adding its selector to the HTML Processing > Remove HTML elements (CSS selector). On this website, the cookie overlay can be targetted by the selector dialog[open]. Don't forget to prepend this selector by a comma (,) - the entire textbox is one CSS selector!

All in all, your Remove HTML elements input option should look something like this:

1...
2[role="region"][aria-label*="skip" i],
3[aria-modal="true"],
4dialog[open]

Check out my example run (and feel free to look at the input - or even reuse it in your Actor runs :))

I'll close this issue now, but feel free to reopen it, in case you encounter any issues with the approach above. Thanks again!

Developer
Maintained by Apify
Actor metrics
  • 1.9k monthly users
  • 99.9% runs succeeded
  • 2.9 days response time
  • Created in Mar 2023
  • Modified 3 days ago