![Website Content Crawler avatar](https://images.apifyusercontent.com/1VrdawICnxIwM4X5JzRJHPBmLx0OpmiNxtHGGLmxdu8/rs:fill:92:92/aHR0cHM6Ly9hcGlmeS1pbWFnZS11cGxvYWRzLXByb2QuczMuYW1hem9uYXdzLmNvbS9hWUcwbDlzN2RiQjdqM2diUy9QZlRvRU5rSlp4YWh6UER1My1DbGVhblNob3RfMjAyMy0wMy0yOF9hdF8xMC40MC4yMF8yeC5wbmc.webp)
No credit card required
![Website Content Crawler](https://images.apifyusercontent.com/1VrdawICnxIwM4X5JzRJHPBmLx0OpmiNxtHGGLmxdu8/rs:fill:92:92/aHR0cHM6Ly9hcGlmeS1pbWFnZS11cGxvYWRzLXByb2QuczMuYW1hem9uYXdzLmNvbS9hWUcwbDlzN2RiQjdqM2diUy9QZlRvRU5rSlp4YWh6UER1My1DbGVhblNob3RfMjAyMy0wMy0yOF9hdF8xMC40MC4yMF8yeC5wbmc.webp)
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.
Failed Crawling for G2 web pages
Open
Problem: Crawling requests for G2 never succeeds in runs after 10 session rotations. This is consistent with run over the last few days.
Not expecting this as there are G2 scraping actors and the site overall does not seem anti-scraping. Thank you!
![lukaskrivka avatar](https://apify-image-uploads-prod.s3.amazonaws.com/mPSyG35Lffj5ybtgz/3xNTfQWj8svZAjh5r-bigger_photo.jpg)
Hello,
Thanks for the report. Actually, g2.com has one of the strongest Cloudflare protection we have seen (at least some types of pages). There is a customized approach to scrape it, the team will look into how to incorporate it to this actor.
motivated_leaflet
Thank you! Looking forward to the update. G2 is a pretty key part of our project use case.
- 2.8k monthly users
- 317 stars
- 100.0% runs succeeded
- 4 days response time
- Created in Mar 2023
- Modified 1 day ago