Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.0 (40)

Pricing

Pay per usage

1392

Total users

53K

Monthly users

7.9K

Runs succeeded

>99%

Issues response

6.8 days

Last modified

4 days ago

IA

Avoid query parameters when crawling websites

Open

innovum_admin opened this issue
6 days ago

We have a complex website we would like to crawl with urls of the form <core_url>/welcome.php?action=none&show=belegung&view=week&raum_id[0]=5253&start_date=2025-11-05&dz=1746583360

This is causing a very big number of pages to be crawled and the crawler to take a significant amount of time. As we care only about root pages, without different parameters, we were wondering if there is a way to crawl excluding parameters, or mark specific parameters as "not to be used".

We are looking for guidance as to how to best achieve this and still get the important part of the pages crawled.