Website Content Crawler
No credit card required
Website Content Crawler
No credit card required
Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.
Do you want to learn more about this Actor?
Get a demoHi there 👋
I'm encountering an issue when trying to retrieve the content of pages in certain circumstances
Here the current behavior
I'm providing an input that contains (among others), the exact following url:
1{ 2 "url": "https://www.skipass-laplagne.com/en/nature-&-ski-area" 3}
The actor run successfully.
When retrieving the output, I can see that the crawler encoded the path of the url, so it becomes https://www.skipass-laplagne.com/en/nature-%26-ski-area
(notice how the &
has been transformed to %26
)
This is not an equivalent URL, and it prevents the system to correctly retrieve the content of the provided url, as the website returns an empty page on the transformed url
expected behavior: I expect that the url is not transformed, so we retrieve the content of the website.
Is there anything I can do to fix this?
Thanks for your help
Hi, thank you for using Website Content Crawler.
Thank you for your detailed explanation — it really helped me quickly understand your issue.
I’m sorry, but in this case, the website is using the reserved "&" character incorrectly.
RFC 3986:
"Characters allowed in a URI are either reserved, unreserved, or part of a percent-encoding. Reserved characters sometimes have special meanings."
This means that when reserved characters like "&" are used in URI paths or query strings, they should be percent-encoded if their usage conflicts with their reserved purpose.
While web browsers typically handle such cases, the Website Content Crawler encodes URLs by default, and currently, there’s no way to disable this behavior.
If you need to scrape just this particular page, I recommend using the RAG Web Browser, which can handle such URLs. Please see this example run.
I’m sorry I couldn’t be of more help. Jiri
Thanks for your reactivity.
Your analysis makes perfectly sense to me.
I've been misled by the behavior of my web browser that doesn't re encode the path, as you've mentioned it. But the RFC you've quoted is quite clear about what the url should look like.
Anyway, I understand this issue is not a bug on your side
Thank for your time, have a nice week,
François-Xavier
Actor Metrics
4k monthly users
-
840 stars
>99% runs succeeded
1 days response time
Created in Mar 2023
Modified 20 hours ago