Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.9 (41)

Pricing

Pay per usage

1537

Total users

59K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

7.8 days

Last modified

2 days ago

MV

Actor stalled discovering sitemaps

Closed

MavenAGI opened this issue
a year ago

This is the only output after 2 hours:

2024-06-14T19:02:29.136Z ACTOR: Pulling Docker image of build JJzdJWfkVbNeexCpB from repository.
2024-06-14T19:02:29.234Z ACTOR: Creating Docker container.
2024-06-14T19:02:29.731Z ACTOR: Starting Docker container.
2024-06-14T19:02:30.265Z Starting X virtual framebuffer using: Xvfb :99 -ac -screen 0 1920x1080x24+32 -nolisten tcp
2024-06-14T19:02:30.268Z Executing main command
2024-06-14T19:02:32.176Z INFO System info {"apifyVersion":"3.2.3","apifyClientVersion":"2.9.0","crawleeVersion":"3.10.3","osType":"Linux","nodeVersion":"v18.19.1"}
2024-06-14T19:02:32.396Z INFO Discovering possible sitemap files from the start URLs...
2024-06-14T21:07:19.516Z ACTOR: The Actor run was aborted by the user.
jindrich.bar avatar

Closing as a duplicate of https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/issues/5AfOIAxLtcJYZnDNy .

We are aware of this issue and have a non-blocking sitemap parser implementation semi-ready - but it's still missing some parts (and tests). Feel free to check out the PR and give us your opinion. We're planning to merge this during this or the next week in the crawling library - only then we can propagate the fixes to Website Content Crawler.

Thank you for your patience.