Automatically crawl and extract text content from websites with documentation, knowledge bases, help centers, or blogs. This Actor is designed to provide data to feed, fine-tune, or train large language models such as ChatGPT or LLaMA.

Pages can get scraped multiple times if the page has query parameters


harlequin_broom opened this issue
a year ago

If a page has a link that appends query parameters to the URL then the URL+parameters is considered a child page and scraped. This can cause duplicate results as the page is scraped for each query parameter.

Hello Marcus,

Thanks for the suggestion.

This is expected behavior as query parameters might change the content of the page so we want to scrape those as different pages. But we should be able to add some logic to find which query params lead to duplicate pages and should be stripped away. I'm adding this to our feature map.

We implemented the canonical link deduplication and we are now looking into the most common parameters we should strip away. Will come in some near future update.

Hi @Lukáš Křivka

Any update on this? An example is;", Seems the canonical link deduplication isn't working?

The hl parameter just maxed out my monthly credits crawling the same content in 30 different languages. Even with aggressive cleansing here:

Any way around this?

Hi Dave,

We have some ideas and analyzed a few sites where this happens. There isn't really a one-size-fits-all solution. Funnily on different websites, the languages have valid different content and the same canonical link so it is skipped but we don't want to skip it.

In your case, the canonical link (as in the HTML element) on the page is really

The team will check it more

Thanks for suggestion

Hi Dave,

Hope you're doing well! We have a solution for the issue you're facing with the localized variants consuming your credits. You can resolve this by using the**/*\\?*hl=* exclude glob pattern. It will prevent the localized versions from being included during the crawl, only keeping the base (e.g. .../shopping-ads/reporting) there.

Also, by default, the crawler crawls only subpages of the initial URLs (in your case, starting from would crawl, but not To tell the crawler which pages you want to crawl, you can set multiple URLs as start URLs - they don't even have to target an existing page. In your case, you can crawl all the shopping ads pages by using two start URLs:

  • setting the initial page to crawl from.
  •, allowing the crawler to process all the pages in the shopping-ads directory (even though this URL itself returns 404).

In addition to those, you can add as many start URLs as you wish.

We apologize for any inconvenience caused by the canonical link deduplication. Our team is here to assist you and ensure you have a smooth experience with our tool.

If you have any further questions or need additional assistance, please don't hesitate to let us know.

Unfortunately, this solution solves the original issue only partially, as it relies on a case-by-case treatment. I'll keep the issue open until we have a generic enough solution.

