Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoProblem: Crawling requests for G2 never succeeds in runs after 10 session rotations. This is consistent with run over the last few days.
Not expecting this as there are G2 scraping actors and the site overall does not seem anti-scraping. Thank you!
Hello,
Thanks for the report. Actually, g2.com has one of the strongest Cloudflare protection we have seen (at least some types of pages). There is a customized approach to scrape it, the team will look into how to incorporate it to this actor.
Thank you! Looking forward to the update. G2 is a pretty key part of our project use case.
Hi, I apologize for the very late response. :(
I’m currently reviewing the issues, attempting to reproduce them, and providing solutions.
There have been many improvements to the Website Content Crawler and the underlying Crawlee library over time.
I tested it again, and while the protection remains strong, we are now able to crawl g2.com pages. However, the crawling process may take some time as session rotation is required when the crawler gets blocked.
12024-12-18T07:59:07.726Z WARN AdaptiveCrawler: Reclaiming failed request back to the list or queue. Detected a session error, rotating session... 22024-12-18T07:59:07.727Z Received blocked status code: 403
Please see my example run with residential proxy settings. I aborted it to avoid wasting resources.
I’ll close this issue for now, but feel free to ask any questions or raise a new issue.
Actor Metrics
4.2k monthly users
-
873 stars
>99% runs succeeded
23 hours response time
Created in Mar 2023
Modified 5 days ago