Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoHello, We're encountering an issue with duplicate URLs in our web crawling process. This redundancy is leading to unnecessary resource consumption and inefficiency. The current setup, using LlamaIndex for web crawling, is producing duplicate URLs, which wastes system resources and impacts performance. We need to implement a URL deduplication strategy to filter out duplicates and optimize our resource usage.
Hello, and thank you for your interest in Website Content Crawler! I looked into your last runs and it does seem that the usual deduplication is malfunctioning there. We will look into this and let you know.
Hello Jan Buchar, Is there any update on the above issue?
Hello, unfortunately we haven't yet been able to look into this.
Actor Metrics
3.9k monthly users
-
711 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 4 hours ago