Website Content Crawler avatar
Website Content Crawler
Try for free

No credit card required

View all Actors
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗LangChain, LlamaIndex, and the wider LLM ecosystem.

HB

Pages can get scraped multiple times if the page has query parameters

Open

harlequin_broom opened this issue
a year ago

If a page has a link that appends query parameters to the URL then the URL+parameters is considered a child page and scraped. This can cause duplicate results as the page is scraped for each query parameter.

lukaskrivka avatar

Hello Marcus,

Thanks for the suggestion.

This is expected behavior as query parameters might change the content of the page so we want to scrape those as different pages. But we should be able to add some logic to find which query params lead to duplicate pages and should be stripped away. I'm adding this to our feature map.

lukaskrivka avatar

We implemented the canonical link deduplication and we are now looking into the most common parameters we should strip away. Will come in some near future update.

davedavis avatar

Hi @Lukáš Křivka

Any update on this? An example is; https://developers.google.com/google-ads/api/docs/shopping-ads/reporting?hl=fa", Seems the canonical link deduplication isn't working?

The hl parameter just maxed out my monthly credits crawling the same content in 30 different languages. Even with aggressive cleansing here: https://developers.google.com/google-ads/api/docs/shopping-ads/reporting

Any way around this?

lukaskrivka avatar

Hi Dave,

We have some ideas and analyzed a few sites where this happens. There isn't really a one-size-fits-all solution. Funnily on different websites, the languages have valid different content and the same canonical link so it is skipped but we don't want to skip it.

In your case, the canonical link (as in the HTML element) on the page is really https://developers.google.com/google-ads/api/docs/shopping-ads/reporting?hl=fa.

The team will check it more

Thanks for suggestion

jindrich.bar avatar

Hi Dave,

Hope you're doing well! We have a solution for the issue you're facing with the localized variants consuming your credits. You can resolve this by using the https://developers.google.com/**/*\\?*hl=* exclude glob pattern. It will prevent the localized versions from being included during the crawl, only keeping the base (e.g. .../shopping-ads/reporting) there.

Also, by default, the crawler crawls only subpages of the initial URLs (in your case, starting from ...shopping-ads/reporting would crawl ...shopping-ads/reporting/xyz, but not ...shopping-ads/other-path). To tell the crawler which pages you want to crawl, you can set multiple URLs as start URLs - they don't even have to target an existing page. In your case, you can crawl all the shopping ads pages by using two start URLs:

  • https://developers.google.com/google-ads/api/docs/shopping-ads/reporting setting the initial page to crawl from.
  • https://developers.google.com/google-ads/api/docs/shopping-ads/, allowing the crawler to process all the pages in the shopping-ads directory (even though this URL itself returns 404).

In addition to those, you can add as many start URLs as you wish.

We apologize for any inconvenience caused by the canonical link deduplication. Our team is here to assist you and ensure you have a smooth experience with our tool.

If you have any further questions or need additional assistance, please don't hesitate to let us know.


Unfortunately, this solution solves the original issue only partially, as it relies on a case-by-case treatment. I'll keep the issue open until we have a generic enough solution.

Developer
Maintained by Apify
Actor metrics
  • 2.8k monthly users
  • 317 stars
  • 100.0% runs succeeded
  • 4 days response time
  • Created in Mar 2023
  • Modified 1 day ago