
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.6 (38)
Pricing
Pay per usage
1.1k
Monthly users
6k
Runs succeeded
>99%
Response time
2.3 days
Last modified
7 days ago
Scpraing is extremely slow after a few hours
Scraping starts off fast for the first 1 hour, then starts slowing down. After a few hours, it 's too slow. The "desired concurrency" is stuck at 1. The RAM usage is maxed out. "

Hi,
Please try splitting the scraping task into multiple runs and increasing the initialConcurrency
value based on the memory settings you’ve configured for the run options.
Let me know if this resolves the issue for you, and feel free to reply if you need any further assistance,
Jakub
vnandan
Hi,
- I do set the initailconcurrency to a high value.
- This is happening even when the run is going on for 1-2 hours. The number of links should not matter as not all of them are being downloaded simultaneously. The speed, in my expectation, should stay more or less constant throughout. Further, it seems there is memory leak in the agent because no matter how much RAM is assigned it ends up using all of it eventually. Do note that at the start of the run the RAM usage is very low and builds up over time. Finally, if this was due to the queue being long, the RAM should be highest at the start and go down at the end, but the noticed RAM usage is opposite. It starts low and goes to max and stays there
- Let me know what the right size in terms of number of urls in a single run should be done and the cause behind the high RAM use

Hi,
Thank you for using Website Content Crawler.
Jakub, splitting won't help—that's not the issue.
I checked the run, and you’re right. The speed fluctuates, which can sometimes happen due to the size of the webpages. However, there’s a strange pattern that needs further investigation.
I need to run this internally for debugging.
Sorry for any inconvenience. I'll keep you updated.
vnandan
Thanks for acknowledgment of the issue.
I don't think the page size is an issue, at least not in my case. You can check out the URLs I'm downloading. They are pretty tame. They don't have any javascript based rendering. Vanilla html and js.
I'm thinking garbage cleanup in the script is not happening properly. This is also evident from the ram usage which builds up over time(1-2 hours) and stays at the max limit for the rest of the run.
Looking forward to your investigation.

Hi, thank you again for using Website Content Crawler. We've reviewed your run as well as others, and we don’t believe there is a memory leak or an issue with garbage collection. We have runs that span several days without observing such problems.
There is always a tradeoff between consistent speed and maximum speed. We strive to optimize it, but crawling speed can sometimes peak and other times slow down due to various factors that are out of our control. We are continuously working on improving this performance.
I'm sorry I don’t have a definitive answer at this time but, right now, this is the best information I can provide. I'll go ahead and close this issues now, but please feel free to ask questions or raise a new issue. Jiri
Pricing
Pricing model
Pay per usageThis Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.