Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoThis is the only output after 2 hours:
12024-06-14T19:02:29.136Z ACTOR: Pulling Docker image of build JJzdJWfkVbNeexCpB from repository. 22024-06-14T19:02:29.234Z ACTOR: Creating Docker container. 32024-06-14T19:02:29.731Z ACTOR: Starting Docker container. 42024-06-14T19:02:30.265Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp 52024-06-14T19:02:30.268Z Executing main command 62024-06-14T19:02:32.176Z INFO System info {"apifyVersion":"3.2.3","apifyClientVersion":"2.9.0","crawleeVersion":"3.10.3","osType":"Linux","nodeVersion":"v18.19.1"} 72024-06-14T19:02:32.396Z INFO Discovering possible sitemap files from the start URLs... 82024-06-14T21:07:19.516Z ACTOR: The Actor run was aborted by the user.
Closing as a duplicate of https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/issues/5AfOIAxLtcJYZnDNy .
We are aware of this issue and have a non-blocking sitemap parser implementation semi-ready - but it's still missing some parts (and tests). Feel free to check out the PR and give us your opinion. We're planning to merge this during this or the next week in the crawling library - only then we can propagate the fixes to Website Content Crawler.
Thank you for your patience.
- 3.8k monthly users
- 544 stars
- 99.9% runs succeeded
- 3.4 days response time
- Created in Mar 2023
- Modified 1 day ago