Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
JH

crawled a website and it just returns messages about cookies, not the text of the site

Closed

jhonovich opened this issue
23 days ago

Most of the pages on axis.com returns a generic cookie message and not the actual page content, e.g. https://www.axis.com/en-us/solutions/aviation "Aviation | Axis Communications Cookie settings Axis uses cookies to remember your user preferences, for storing anonymized user statistics, for marketing, and to understand how people use our sites so that we can improve the quality of our services. We use cookies to track trends and patterns of how people use our sites. More information: Cookie policy and Cookie list.

We use the default settings I believe. Anything we are doing wrong or suggestions. Thanks.

jiri.spilka avatar

Hi,

Thank you for using Website Content Crawler.

The following solution comes from @jindrich.bar, with all credit to him.

The cookie modal is present in a shadow root, and we are unable to close it with a script. This was a known limitation a while ago, though it may have changed since the last time we checked.

However, you can successfully scrape the content by changing the crawler type to Firefox (from adaptive) and setting the HTML processing option htmlTransformer to None (instead of Readable text). Please see Jindrich's example run (which he aborted to save resources).

I hope this helps! I’ll go ahead and close this issue now, but feel free to ask additional questions or raise a new issue if needed. Jiri

Developer
Maintained by Apify

Actor Metrics

  • 4k monthly users

  • 840 stars

  • >99% runs succeeded

  • 1 days response time

  • Created in Mar 2023

  • Modified 21 hours ago