Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoI keep getting the following response - looks like the crawler is getting blocked by cookies: 1/0 Capital investor portfolio, rounds & team Cookies for app.dealroom.co Thank you for visiting our website! We use cookies to optimize your user experience, to analyze web traffic and for marketing purposes. Read more about how we use cookies and how you can manage them by clicking "Edit preferences". If you agree to our use of cookies, click "Accept all and continue".
Hi,
Thank you for using Website Content Crawler.
The following solution comes from @jindrich.bar, with all credit to him.
The cookie modal is present in a shadow root, and we are unable to close it with a script. This was a known limitation a while ago, though it may have changed since the last time we checked.
However, you can successfully scrape the content by changing the crawler type to Firefox (from adaptive) and setting the HTML processing option htmlTransformer to None (instead of Readable text). Please see my run.
I hope this helps! I’ll go ahead and close this issue now, but feel free to ask additional questions or raise a new issue if needed.
Looks like it worked - thanks for that!
Actor Metrics
4k monthly users
-
840 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 21 hours ago