Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
GR

Why does this job take so long?

Closed

grossjo opened this issue
3 days ago

I'm just wondering why this simple job is taking 8 hrs to scrape 8k urls. Other web scrapes of the same size have been much faster. Looking at the documentation, all I see is that some websites are "more complex" resulting in longer duration. But it be great if you could tell me how to speed this up otherwise, maybe some of my input parameters could be changed?

dusan.vystrcil avatar

Hi, thank you for using Website Content Crawler.

The issue seems to be caused by the memory limitation. Your actor is currently allocated only 1 GB of memory, which could be insufficient for this task. The logs show critical memory overload: 2024-12-01T14:18:03.984Z WARN CheerioCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 976 MB of 1024 MB (95%). Consider increasing available memory.

Lower memory also limits the concurrency to just 1 crawler instance, which significantly slows down the process.

To resolve this, please increase the memory allocation to at least 4 GB. This should allow the actor to run efficiently and complete the task much faster.

I’ll close this issue for now, but feel free to reply here or open a new issue if you have further questions.

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 718 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 15 hours ago