Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoHi, my runs, that usually take 10 Minutes, now time out after two hours. From the logs it looks like the run as usual and also end properly after 10 Minutes but then for the next 2 hours the following messages pop up once a minute.
2024-08-01T00:11:29.167Z INFO HttpCrawler:Statistics: HttpCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5904,"requestsFinishedPerMinute":11,"requestsFailedPerMinute":0,"requestTotalDurationMillis":631724,"requestsTotal":107,"crawlerRuntimeMillis":607551,"retryHistogram":[107]} 2024-08-01T00:11:45.093Z INFO HttpCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
Hello, and thank you for yout interest in Website Content Crawler. There are multiple things to consider. The messages that you're seeing indicate that there are ongoing file downloads. Is that desirable? Also, there are some links that take a long time and time out before we manage to process them. These look like file downloads as well.
If the runs took a shorter time before, it may be that you didn't save downloads before, or the page became slower, or many new files appeared that can be downloaded.
Hi
unfortunately I didn't change anything to crawler settings (I did now lower the overall timeout). There might be changes to the website. But the website seems to work absolutely fine (for me) There are no slow downloads or other time outs. There seems to be an issue with the crawler. Also: Let us configure per link timeouts and also the number of retries. The defaults are ridiculously high.
also note: The crawler said it ended and will shut down minutes and sometimes hours before the time out. The crawler seems to fail to stop at the end of the crawl.
Actor Metrics
3.9k monthly users
-
714 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 20 hours ago