Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
SI

Getting duplicate URLs in web crawling

Open

simpleworks opened this issue
4 months ago

Hello, We're encountering an issue with duplicate URLs in our web crawling process. This redundancy is leading to unnecessary resource consumption and inefficiency. The current setup, using LlamaIndex for web crawling, is producing duplicate URLs, which wastes system resources and impacts performance. We need to implement a URL deduplication strategy to filter out duplicates and optimize our resource usage.

janbuchar avatar

Hello, and thank you for your interest in Website Content Crawler! I looked into your last runs and it does seem that the usual deduplication is malfunctioning there. We will look into this and let you know.

SI

simpleworks

4 months ago

Hello Jan Buchar, Is there any update on the above issue?

janbuchar avatar

Hello, unfortunately we haven't yet been able to look into this.

Developer
Maintained by Apify

Actor Metrics

  • 3.9k monthly users

  • 711 stars

  • >99% runs succeeded

  • 2.2 days response time

  • Created in Mar 2023

  • Modified 4 hours ago