Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
MT

My Runs do not end

Open

matthias.amberg opened this issue
3 months ago

Hi, my runs, that usually take 10 Minutes, now time out after two hours. From the logs it looks like the run as usual and also end properly after 10 Minutes but then for the next 2 hours the following messages pop up once a minute.

2024-08-01T00:11:29.167Z INFO HttpCrawler:Statistics: HttpCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5904,"requestsFinishedPerMinute":11,"requestsFailedPerMinute":0,"requestTotalDurationMillis":631724,"requestsTotal":107,"crawlerRuntimeMillis":607551,"retryHistogram":[107]} 2024-08-01T00:11:45.093Z INFO HttpCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

janbuchar avatar

Hello, and thank you for yout interest in Website Content Crawler. There are multiple things to consider. The messages that you're seeing indicate that there are ongoing file downloads. Is that desirable? Also, there are some links that take a long time and time out before we manage to process them. These look like file downloads as well.

If the runs took a shorter time before, it may be that you didn't save downloads before, or the page became slower, or many new files appeared that can be downloaded.

MT

matthias.amberg

3 months ago

Hi

unfortunately I didn't change anything to crawler settings (I did now lower the overall timeout). There might be changes to the website. But the website seems to work absolutely fine (for me) There are no slow downloads or other time outs. There seems to be an issue with the crawler. Also: Let us configure per link timeouts and also the number of retries. The defaults are ridiculously high.

MT

matthias.amberg

3 months ago

also note: The crawler said it ended and will shut down minutes and sometimes hours before the time out. The crawler seems to fail to stop at the end of the crawl.

Developer
Maintained by Apify
Actor metrics
  • 3.8k monthly users
  • 635 stars
  • 100.0% runs succeeded
  • 2.7 days response time
  • Created in Mar 2023
  • Modified 7 days ago