Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoMost of the pages on axis.com returns a generic cookie message and not the actual page content, e.g. https://www.axis.com/en-us/solutions/aviation "Aviation | Axis Communications Cookie settings Axis uses cookies to remember your user preferences, for storing anonymized user statistics, for marketing, and to understand how people use our sites so that we can improve the quality of our services. We use cookies to track trends and patterns of how people use our sites. More information: Cookie policy and Cookie list.
We use the default settings I believe. Anything we are doing wrong or suggestions. Thanks.
Hi,
Thank you for using Website Content Crawler.
The following solution comes from @jindrich.bar, with all credit to him.
The cookie modal is present in a shadow root, and we are unable to close it with a script. This was a known limitation a while ago, though it may have changed since the last time we checked.
However, you can successfully scrape the content by changing the crawler type to Firefox
(from adaptive
) and setting the HTML processing option htmlTransformer
to None
(instead of Readable text
). Please see Jindrich's example run (which he aborted to save resources).
I hope this helps! I’ll go ahead and close this issue now, but feel free to ask additional questions or raise a new issue if needed. Jiri
Actor Metrics
4k monthly users
-
840 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 21 hours ago