No credit card required
Website Content Crawler
No credit card required
Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.
We noticed that many of the pages in one of our crawls had the "cookie warning" as the extracted text. Our actor input contains the following settings:
{ "removeCookieWarnings": true, "crawlerType": "playwright:firefox" }
Are we missing another setting or is this a limitation of the "I don't care about cookies" browser extension? Thank you!
https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/
Hello and thank you for your interest in this Actor!
Indeed, this seems like an issue with the browser extension we're using for this. You can still remove the cookie warning from your content manually, by adding its selector to the HTML Processing > Remove HTML elements (CSS selector)
. On this website, the cookie overlay can be targetted by the selector dialog[open]
. Don't forget to prepend this selector by a comma (,
) - the entire textbox is one CSS selector!
All in all, your Remove HTML elements
input option should look something like this:
1... 2[role="region"][aria-label*="skip" i], 3[aria-modal="true"], 4dialog[open]
Check out my example run (and feel free to look at the input - or even reuse it in your Actor runs :))
I'll close this issue now, but feel free to reopen it, in case you encounter any issues with the approach above. Thanks again!
- 1.9k monthly users
- 99.9% runs succeeded
- 2.9 days response time
- Created in Mar 2023
- Modified 3 days ago