Website Content Crawler avatar
Website Content Crawler

Pricing

Pay per usage

Go to Store
Website Content Crawler

Website Content Crawler

Developed by

Apify

Apify

Maintained by Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

3.7 (41)

Pricing

Pay per usage

1499

Total users

58K

Monthly users

8.1K

Runs succeeded

>99%

Issues response

7.6 days

Last modified

36 minutes ago

IA

Avoid query parameters when crawling websites

Open

innovum_admin opened this issue
a month ago

We have a complex website we would like to crawl with urls of the form <core_url>/welcome.php?action=none&show=belegung&view=week&raum_id[0]=5253&start_date=2025-11-05&dz=1746583360

This is causing a very big number of pages to be crawled and the crawler to take a significant amount of time. As we care only about root pages, without different parameters, we were wondering if there is a way to crawl excluding parameters, or mark specific parameters as "not to be used".

We are looking for guidance as to how to best achieve this and still get the important part of the pages crawled.

jiri.spilka avatar

Hi, and thank you for using the Website Content Crawler. I apologize for the delayed response. I don’t have a solution at the moment, but I’ve reached out internally for support. Jiri

jindrich.bar avatar

Hello, and thank you for your interest in this Actor!

This became a frequently requested feature, so we decided to bump the priority on this. We're currently discussing the best approach for the implementation. Would you mind sharing the website you're scraping and what pages you want to access? It would help us tremendously with designing the feature so it serves as many customers as possible.

Cheers!