
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.9 (41)
Pricing
Pay per usage
1537
Total users
59K
Monthly users
7.9K
Runs succeeded
>99%
Issues response
7.8 days
Last modified
2 days ago
My Runs do not end
Closed
Hi, my runs, that usually take 10 Minutes, now time out after two hours. From the logs it looks like the run as usual and also end properly after 10 Minutes but then for the next 2 hours the following messages pop up once a minute.
2024-08-01T00:11:29.167Z INFO HttpCrawler:Statistics: HttpCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5904,"requestsFinishedPerMinute":11,"requestsFailedPerMinute":0,"requestTotalDurationMillis":631724,"requestsTotal":107,"crawlerRuntimeMillis":607551,"retryHistogram":[107]} 2024-08-01T00:11:45.093Z INFO HttpCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

Hello, and thank you for yout interest in Website Content Crawler. There are multiple things to consider. The messages that you're seeing indicate that there are ongoing file downloads. Is that desirable? Also, there are some links that take a long time and time out before we manage to process them. These look like file downloads as well.
If the runs took a shorter time before, it may be that you didn't save downloads before, or the page became slower, or many new files appeared that can be downloaded.
matthias.amberg
Hi
unfortunately I didn't change anything to crawler settings (I did now lower the overall timeout). There might be changes to the website. But the website seems to work absolutely fine (for me) There are no slow downloads or other time outs. There seems to be an issue with the crawler. Also: Let us configure per link timeouts and also the number of retries. The defaults are ridiculously high.
matthias.amberg
also note: The crawler said it ended and will shut down minutes and sometimes hours before the time out. The crawler seems to fail to stop at the end of the crawl.

Dušan Vystrčil (dusan.vystrcil)
Hi Matthias, unfortunately, we are unable to replicate the issue and therefore cannot provide further assistance at this time. If you encounter this error again, please let us know, and we will do our best to assist you.
We are closing this issue, but feel free to ask any further questions.