Pricing

Pay per usage

Go to Store

Website Content Crawler

Try for free

Developed by

Apify

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

4.6 (38)

Pricing

Pay per usage

1234

Monthly users

6.4k

Runs succeeded

>99%

Response time

4.6 days

Last modified

14 hours ago

Developer tools

Back to issues Create new issue

Input url's path is re-encoded when it contains specific characters, breaking the url

Closed

iadvize opened this issue

Hi there 👋

I'm encountering an issue when trying to retrieve the content of pages in certain circumstances

Here the current behavior

I'm providing an input that contains (among others), the exact following url:

1{
2  "url": "https://www.skipass-laplagne.com/en/nature-&-ski-area"
3}

The actor run successfully.

When retrieving the output, I can see that the crawler encoded the path of the url, so it becomes https://www.skipass-laplagne.com/en/nature-%26-ski-area (notice how the & has been transformed to %26 )

This is not an equivalent URL, and it prevents the system to correctly retrieve the content of the provided url, as the website returns an empty page on the transformed url

expected behavior: I expect that the url is not transformed, so we retrieve the content of the website.

Is there anything I can do to fix this?

Thanks for your help

Jiří Spilka (jiri.spilka)

Hi, thank you for using Website Content Crawler.

Thank you for your detailed explanation — it really helped me quickly understand your issue.

I’m sorry, but in this case, the website is using the reserved "&" character incorrectly.

RFC 3986:

"Characters allowed in a URI are either reserved, unreserved, or part of a percent-encoding. Reserved characters sometimes have special meanings."

This means that when reserved characters like "&" are used in URI paths or query strings, they should be percent-encoded if their usage conflicts with their reserved purpose.

While web browsers typically handle such cases, the Website Content Crawler encodes URLs by default, and currently, there’s no way to disable this behavior.

If you need to scrape just this particular page, I recommend using the RAG Web Browser, which can handle such URLs. Please see this example run.

I’m sorry I couldn’t be of more help. Jiri

iadvize

Thanks for your reactivity.

Your analysis makes perfectly sense to me.

I've been misled by the behavior of my web browser that doesn't re encode the path, as you've mentioned it. But the RFC you've quoted is quite clear about what the url should look like.

Anyway, I understand this issue is not a bug on your side

Thank for your time, have a nice week,

François-Xavier

Add comment

Pricing

Pricing model

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage.

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

494

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

Extract-any-webpage-content-for-llm

ai-developer/extract-any-webpage-content-for-llm

Fast and easy way to extract data from any webpage and are LLM friendly. The tool lets you easily extract content from any website. Ideal for researchers, marketers, and developers.

aideveloper

423

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

578

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

Ai Web Scraper - Extract Data With Ease

eloquent_mountain/ai-web-scraper-extract-data-with-ease

Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.

Paco

259

Html To Markdown Converter 📄

powerful_bachelor/html-to-markdown-converter

📄✨ HTML to Markdown Converter transforms web pages into clean, portable Markdown. Simply input a URL to extract content while preserving structure, formatting, and media elements.🔄 Perfect for content repurposing, documentation, and creating readable, platform-independent text from any webpage! 🚀

Powerful Bachelor

Dynamic Markdown Scraper

louisdeconinck/dynamic-markdown-scraper

Effortlessly feed LLM AIs with clean Markdown using our advanced web scraper. Seamlessly scrape dynamic, JavaScript-rendered websites while preserving original formatting. Ideal for AI training, documentation, and content migration.

Louis Deconinck

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David Deng

263