Crawl4AI Web to Markdown — URL to Clean Markdown for LLM & RAG avatar

Crawl4AI Web to Markdown — URL to Clean Markdown for LLM & RAG

Pricing

from $1.00 / 1,000 page converteds

Go to Apify Store
Crawl4AI Web to Markdown — URL to Clean Markdown for LLM & RAG

Crawl4AI Web to Markdown — URL to Clean Markdown for LLM & RAG

Convert any URL, sitemap, or whole website into clean Markdown for LLMs, RAG pipelines, and AI agents. Powered by the open-source Crawl4AI engine. Pay per page ($1/1,000), failed pages never charged. MCP-ready — call it from Claude or Cursor.

Pricing

from $1.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

Bikram

Bikram

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 hours ago

Last modified

Share

Convert any URL to clean, LLM-ready Markdown — without installing or hosting anything. This Actor is a hosted Crawl4AI: it wraps the popular open-source Crawl4AI crawler (the most-starred LLM-friendly web crawler on GitHub) and runs it on Apify's infrastructure with a real Chromium browser, so JavaScript-heavy pages render correctly. Point it at a URL, a sitemap, or a whole site, and get back boilerplate-free Markdown ready for RAG pipelines, vector databases, fine-tuning datasets, or direct pasting into an LLM context window.

Features

  • URL to Markdown in one call — single pages, full sitemaps, or breadth-first site crawls (up to 1,000 pages per run)
  • Built on Crawl4AI — the same AsyncWebCrawler + pruning content filter you'd run locally, with zero setup
  • Boilerplate removal — navigation menus, footers, cookie banners and sidebars are stripped, leaving "fit markdown" optimized for token budgets
  • Real browser rendering — Chromium via Playwright, so SPAs and JavaScript-rendered content convert correctly
  • Three output formats — Markdown only, Markdown + cleaned HTML, or Markdown + metadata/links JSON
  • RAG-friendly dataset output — each page is one dataset item with url, title, markdown, wordCount, crawledAt; export as JSON, CSV, or via API
  • Respects robots.txt by default (configurable)
  • Fair pay-per-event pricing — you are charged only for pages that convert successfully; failed pages are free
  • MCP-ready — callable as a tool from Claude, Cursor, or any MCP client via Apify's MCP server

Input example

{
"startUrls": [{ "url": "https://docs.crawl4ai.com" }],
"crawlMode": "crawl",
"maxPages": 50,
"includeLinks": false,
"outputFormat": "markdown",
"removeBoilerplate": true,
"respectRobotsTxt": true
}
FieldTypeDefaultDescription
startUrlsarray— (required)URLs to convert
crawlModestringsinglesingle (only listed URLs), sitemap (pages from sitemap.xml), crawl (follow same-domain links)
maxPagesinteger10Max pages per run (1–1000)
includeLinksbooleanfalseKeep hyperlinks in the Markdown
outputFormatstringmarkdownmarkdown, markdown+html, or markdown+json
removeBoilerplatebooleantrueStrip navigation/footer/cookie-banner noise ("fit markdown")
respectRobotsTxtbooleantrueSkip pages disallowed by robots.txt (not charged)
proxyConfigurationobjectnoneOptional Apify Proxy / custom proxy settings

Output example

Each successfully converted page becomes one dataset item:

{
"url": "https://docs.crawl4ai.com/core/quickstart/",
"title": "Quick Start - Crawl4AI Documentation",
"markdown": "# Getting Started with Crawl4AI\n\nWelcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper...",
"wordCount": 1183,
"crawledAt": "2026-06-13T10:42:07.512345+00:00"
}

With outputFormat: "markdown+json", items additionally contain metadata (description, og tags, etc.) and links.internal / links.external arrays. With markdown+html, items contain the html field with cleaned HTML.

Pricing — about $1 per 1,000 pages

This Actor uses Apify's pay-per-event model with one simple event:

EventPriceWhen it's charged
page-converted$0.001Once per page successfully converted to Markdown

That's $1 per 1,000 pages, plus standard Apify platform usage for your runs (compute, proxy if enabled). Pages that fail to load, return an HTTP error, time out, or are blocked by robots.txt are never charged. You can also set a maximum cost per run in Apify Console — the Actor stops gracefully when your limit is reached.

Comparable webpage-to-markdown Actors on Apify Store charge up to $0.05 per page for the same job.

Use from Claude, Cursor & other AI agents (MCP)

This Actor works as a tool over the Model Context Protocol. Add Apify's MCP server to your client and your agent can convert URLs to Markdown on demand:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com/sse?actors=bikram07/web-to-markdown-crawl4ai",
"headers": {
"Authorization": "Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Then ask your agent things like: "Fetch https://example.com/blog as Markdown and summarize it" — the agent calls this Actor, gets clean Markdown back, and works with it directly. This is ideal for agentic RAG: the agent decides what to read, this Actor handles rendering, extraction, and cleanup.

You can also call it from code via the Apify API:

curl -X POST "https://api.apify.com/v2/acts/bikram07~web-to-markdown-crawl4ai/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"startUrls": [{"url": "https://example.com"}], "crawlMode": "single"}'

Hosted vs. self-hosted Crawl4AI

Crawl4AI is open source — you can absolutely run it yourself. Self-hosting means managing a Python environment, Playwright browser binaries, OS dependencies, memory for Chromium, retries, and a server that's always on. This Actor is for the cases where that overhead isn't worth it: you pay roughly $1 per 1,000 pages, get an HTTPS API + MCP endpoint immediately, scale to parallel runs without provisioning anything, and your results land in queryable dataset storage. If you're converting millions of pages a month on dedicated hardware, self-hosting can be cheaper; for everything from prototypes to production RAG ingestion at moderate volume, hosted is simpler.

FAQ

How do I convert a website to Markdown for an LLM? Add the site URL to startUrls, pick crawl mode (or sitemap if the site has a sitemap.xml), set maxPages, and run. Each page becomes a dataset item with clean Markdown you can chunk and embed for RAG.

Does it handle JavaScript-rendered pages and SPAs? Yes. Pages are rendered in headless Chromium via Playwright before conversion, so client-side rendered content is included — unlike simple HTML-to-markdown converters that only see the initial HTML.

What's the difference between this and running crawl4ai locally? The conversion engine is the same library. The difference is operational: no Python/Playwright setup, no server to maintain, an instant REST API and MCP endpoint, parallel scaling, and dataset storage with JSON/CSV export. See the comparison section above.

Am I charged for pages that fail? No. The page-converted event is only charged for pages that successfully convert. Timeouts, HTTP errors, and robots.txt-blocked pages are logged and free. You can also cap the maximum total cost per run in Apify Console.

Can I keep links and raw HTML in the output? Yes. Set includeLinks: true to preserve hyperlinks in the Markdown, and outputFormat: "markdown+html" or "markdown+json" to additionally get cleaned HTML or metadata + link lists per page.

crawl4ai hosted · url to markdown · website to markdown for LLM · web scraping for RAG · html to markdown converter API · convert webpage to markdown for vector database · LLM-ready web content extraction


Built on Crawl4AI (Apache 2.0). This Actor is not affiliated with the Crawl4AI project; it packages the library as a hosted service.