
Website Content Crawler
Pricing
Pay per usage

Website Content Crawler
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
3.7 (41)
Pricing
Pay per usage
1499
Total users
58K
Monthly users
8.1K
Runs succeeded
>99%
Issues response
7.6 days
Last modified
36 minutes ago
Avoid query parameters when crawling websites
Open
We have a complex website we would like to crawl with urls of the form <core_url>/welcome.php?action=none&show=belegung&view=week&raum_id[0]=5253&start_date=2025-11-05&dz=1746583360
This is causing a very big number of pages to be crawled and the crawler to take a significant amount of time. As we care only about root pages, without different parameters, we were wondering if there is a way to crawl excluding parameters, or mark specific parameters as "not to be used".
We are looking for guidance as to how to best achieve this and still get the important part of the pages crawled.

Hi, and thank you for using the Website Content Crawler. I apologize for the delayed response. I don’t have a solution at the moment, but I’ve reached out internally for support. Jiri
Hello, and thank you for your interest in this Actor!
This became a frequently requested feature, so we decided to bump the priority on this. We're currently discussing the best approach for the implementation. Would you mind sharing the website you're scraping and what pages you want to access? It would help us tremendously with designing the feature so it serves as many customers as possible.
Cheers!