
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
4.0 (40)
Pricing
Pay per usage
1392
Total users
53K
Monthly users
7.9K
Runs succeeded
>99%
Issues response
6.8 days
Last modified
4 days ago
Avoid query parameters when crawling websites
Open
We have a complex website we would like to crawl with urls of the form <core_url>/welcome.php?action=none&show=belegung&view=week&raum_id[0]=5253&start_date=2025-11-05&dz=1746583360
This is causing a very big number of pages to be crawled and the crawler to take a significant amount of time. As we care only about root pages, without different parameters, we were wondering if there is a way to crawl excluding parameters, or mark specific parameters as "not to be used".
We are looking for guidance as to how to best achieve this and still get the important part of the pages crawled.