Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoIf a page has a link that appends query parameters to the URL then the URL+parameters is considered a child page and scraped. This can cause duplicate results as the page is scraped for each query parameter.
Hello Marcus,
Thanks for the suggestion.
This is expected behavior as query parameters might change the content of the page so we want to scrape those as different pages. But we should be able to add some logic to find which query params lead to duplicate pages and should be stripped away. I'm adding this to our feature map.
We implemented the canonical link deduplication and we are now looking into the most common parameters we should strip away. Will come in some near future update.
Hi @Lukáš Křivka
Any update on this? An example is; https://developers.google.com/google-ads/api/docs/shopping-ads/reporting?hl=fa", Seems the canonical link deduplication isn't working?
The hl parameter just maxed out my monthly credits crawling the same content in 30 different languages. Even with aggressive cleansing here: https://developers.google.com/google-ads/api/docs/shopping-ads/reporting
Any way around this?
Hi Dave,
We have some ideas and analyzed a few sites where this happens. There isn't really a one-size-fits-all solution. Funnily on different websites, the languages have valid different content and the same canonical link so it is skipped but we don't want to skip it.
In your case, the canonical link (as in the HTML element) on the page is really https://developers.google.com/google-ads/api/docs/shopping-ads/reporting?hl=fa.
The team will check it more
Thanks for suggestion
Hi Dave,
Hope you're doing well! We have a solution for the issue you're facing with the localized variants consuming your credits.
You can resolve this by using the https://developers.google.com/**/*\\?*hl=*
exclude glob pattern. It will prevent the localized versions from being included during the crawl, only keeping the base (e.g. .../shopping-ads/reporting
) there.
Also, by default, the crawler crawls only subpages of the initial URLs (in your case, starting from ...shopping-ads/reporting
would crawl ...shopping-ads/reporting/xyz
, but not ...shopping-ads/other-path
). To tell the crawler which pages you want to crawl, you can set multiple URLs as start URLs - they don't even have to target an existing page. In your case, you can crawl all the shopping ads pages by using two start URLs:
https://developers.google.com/google-ads/api/docs/shopping-ads/reporting
setting the initial page to crawl from.https://developers.google.com/google-ads/api/docs/shopping-ads/
, allowing the crawler to process all the pages in theshopping-ads
directory (even though this URL itself returns 404).
In addition to those, you can add as many start URLs as you wish.
We apologize for any inconvenience caused by the canonical link deduplication. Our team is here to assist you and ensure you have a smooth experience with our tool.
If you have any further questions or need additional assistance, please don't hesitate to let us know.
Unfortunately, this s... [trimmed]
Actor Metrics
3.9k monthly users
-
711 stars
>99% runs succeeded
2.2 days response time
Created in Mar 2023
Modified 17 days ago