Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoI'm just wondering why this simple job is taking 8 hrs to scrape 8k urls. Other web scrapes of the same size have been much faster. Looking at the documentation, all I see is that some websites are "more complex" resulting in longer duration. But it be great if you could tell me how to speed this up otherwise, maybe some of my input parameters could be changed?
Hi, thank you for using Website Content Crawler.
The issue seems to be caused by the memory limitation. Your actor is currently allocated only 1 GB of memory, which could be insufficient for this task. The logs show critical memory overload:
2024-12-01T14:18:03.984Z WARN CheerioCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 976 MB of 1024 MB (95%). Consider increasing available memory.
Lower memory also limits the concurrency to just 1 crawler instance, which significantly slows down the process.
To resolve this, please increase the memory allocation to at least 4 GB. This should allow the actor to run efficiently and complete the task much faster.
I’ll close this issue for now, but feel free to reply here or open a new issue if you have further questions.
Actor Metrics
3.9k monthly users
-
718 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 15 hours ago