Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.5 (39)

Pricing

Pay per usage

1355

Total users

51.4k

Monthly users

7.5k

Runs succeeded

>99%

Issues response

6.1 days

Last modified

3 days ago

EA

Remove cookies not behaving as expected

Closed

embrace_ai opened this issue
a year ago

We noticed that many of the pages in one of our crawls had the "cookie warning" as the extracted text. Our actor input contains the following settings:

{ "removeCookieWarnings": true, "crawlerType": "playwright:firefox" }

Are we missing another setting or is this a limitation of the "I don't care about cookies" browser extension? Thank you!

https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/

jindrich.bar avatar

Hello and thank you for your interest in this Actor!

Indeed, this seems like an issue with the browser extension we're using for this. You can still remove the cookie warning from your content manually, by adding its selector to the HTML Processing > Remove HTML elements (CSS selector). On this website, the cookie overlay can be targetted by the selector dialog[open]. Don't forget to prepend this selector by a comma (,) - the entire textbox is one CSS selector!

All in all, your Remove HTML elements input option should look something like this:

...
[role="region"][aria-label*="skip" i],
[aria-modal="true"],
dialog[open]

Check out my example run (and feel free to look at the input - or even reuse it in your Actor runs :))

I'll close this issue now, but feel free to reopen it, in case you encounter any issues with the approach above. Thanks again!