
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.5 (39)
Pricing
Pay per usage
1355
Total users
51.4k
Monthly users
7.5k
Runs succeeded
>99%
Issues response
6.1 days
Last modified
3 days ago
Remove cookies not behaving as expected
Closed
We noticed that many of the pages in one of our crawls had the "cookie warning" as the extracted text. Our actor input contains the following settings:
{ "removeCookieWarnings": true, "crawlerType": "playwright:firefox" }
Are we missing another setting or is this a limitation of the "I don't care about cookies" browser extension? Thank you!
https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/
Hello and thank you for your interest in this Actor!
Indeed, this seems like an issue with the browser extension we're using for this. You can still remove the cookie warning from your content manually, by adding its selector to the HTML Processing > Remove HTML elements (CSS selector)
. On this website, the cookie overlay can be targetted by the selector dialog[open]
. Don't forget to prepend this selector by a comma (,
) - the entire textbox is one CSS selector!
All in all, your Remove HTML elements
input option should look something like this:
...[role="region"][aria-label*="skip" i],[aria-modal="true"],dialog[open]
Check out my example run (and feel free to look at the input - or even reuse it in your Actor runs :))
I'll close this issue now, but feel free to reopen it, in case you encounter any issues with the approach above. Thanks again!