Website Content Crawler avatar

Website Content Crawler

Try for free

No credit card required

Go to Store
Website Content Crawler

Website Content Crawler

apify/website-content-crawler
Try for free

No credit card required

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Do you want to learn more about this Actor?

Get a demo
ID

Input url's path is re-encoded when it contains specific characters, breaking the url

Closed

iadvize opened this issue
a month ago

Hi there 👋

I'm encountering an issue when trying to retrieve the content of pages in certain circumstances

Here the current behavior

I'm providing an input that contains (among others), the exact following url:

1{
2  "url": "https://www.skipass-laplagne.com/en/nature-&-ski-area"
3}

The actor run successfully.

When retrieving the output, I can see that the crawler encoded the path of the url, so it becomes https://www.skipass-laplagne.com/en/nature-%26-ski-area (notice how the & has been transformed to %26 )

This is not an equivalent URL, and it prevents the system to correctly retrieve the content of the provided url, as the website returns an empty page on the transformed url

expected behavior: I expect that the url is not transformed, so we retrieve the content of the website.

Is there anything I can do to fix this?

Thanks for your help

jiri.spilka avatar

Hi, thank you for using Website Content Crawler.

Thank you for your detailed explanation — it really helped me quickly understand your issue.

I’m sorry, but in this case, the website is using the reserved "&" character incorrectly.

RFC 3986:

"Characters allowed in a URI are either reserved, unreserved, or part of a percent-encoding. Reserved characters sometimes have special meanings."

This means that when reserved characters like "&" are used in URI paths or query strings, they should be percent-encoded if their usage conflicts with their reserved purpose.

While web browsers typically handle such cases, the Website Content Crawler encodes URLs by default, and currently, there’s no way to disable this behavior.

If you need to scrape just this particular page, I recommend using the RAG Web Browser, which can handle such URLs. Please see this example run.

I’m sorry I couldn’t be of more help. Jiri

ID

iadvize

a month ago

Thanks for your reactivity.

Your analysis makes perfectly sense to me.

I've been misled by the behavior of my web browser that doesn't re encode the path, as you've mentioned it. But the RFC you've quoted is quite clear about what the url should look like.

Anyway, I understand this issue is not a bug on your side

Thank for your time, have a nice week,

François-Xavier

Developer
Maintained by Apify

Actor Metrics

  • 4k monthly users

  • 840 stars

  • >99% runs succeeded

  • 1 days response time

  • Created in Mar 2023

  • Modified 20 hours ago