
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.9 (41)
Pricing
Pay per usage
1537
Total users
59K
Monthly users
7.9K
Runs succeeded
>99%
Issues response
7.8 days
Last modified
2 days ago
Actor stalled discovering sitemaps
Closed
This is the only output after 2 hours:
2024-06-14T19:02:29.136Z ACTOR: Pulling Docker image of build JJzdJWfkVbNeexCpB from repository.2024-06-14T19:02:29.234Z ACTOR: Creating Docker container.2024-06-14T19:02:29.731Z ACTOR: Starting Docker container.2024-06-14T19:02:30.265Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp2024-06-14T19:02:30.268Z Executing main command2024-06-14T19:02:32.176Z INFO System info {"apifyVersion":"3.2.3","apifyClientVersion":"2.9.0","crawleeVersion":"3.10.3","osType":"Linux","nodeVersion":"v18.19.1"}2024-06-14T19:02:32.396Z INFO Discovering possible sitemap files from the start URLs...2024-06-14T21:07:19.516Z ACTOR: The Actor run was aborted by the user.
Closing as a duplicate of https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/issues/5AfOIAxLtcJYZnDNy .
We are aware of this issue and have a non-blocking sitemap parser implementation semi-ready - but it's still missing some parts (and tests). Feel free to check out the PR and give us your opinion. We're planning to merge this during this or the next week in the crawling library - only then we can propagate the fixes to Website Content Crawler.
Thank you for your patience.