Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
MV

Actor stalled discovering sitemaps

Closed

MavenAGI opened this issue
4 months ago

This is the only output after 2 hours:

12024-06-14T19:02:29.136Z ACTOR: Pulling Docker image of build JJzdJWfkVbNeexCpB from repository.
22024-06-14T19:02:29.234Z ACTOR: Creating Docker container.
32024-06-14T19:02:29.731Z ACTOR: Starting Docker container.
42024-06-14T19:02:30.265Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
52024-06-14T19:02:30.268Z Executing main command
62024-06-14T19:02:32.176Z INFO  System info {"apifyVersion":"3.2.3","apifyClientVersion":"2.9.0","crawleeVersion":"3.10.3","osType":"Linux","nodeVersion":"v18.19.1"}
72024-06-14T19:02:32.396Z INFO  Discovering possible sitemap files from the start URLs...
82024-06-14T21:07:19.516Z ACTOR: The Actor run was aborted by the user.
jindrich.bar avatar

Closing as a duplicate of https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/issues/5AfOIAxLtcJYZnDNy .

We are aware of this issue and have a non-blocking sitemap parser implementation semi-ready - but it's still missing some parts (and tests). Feel free to check out the PR and give us your opinion. We're planning to merge this during this or the next week in the crawling library - only then we can propagate the fixes to Website Content Crawler.

Thank you for your patience.

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 544 stars
  • 99.9% runs succeeded
  • 3.4 days response time
  • Created in Mar 2023
  • Modified 1 day ago