Pricing

from $1.00 / 1,000 page converteds

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown for RAG, vector databases, and AI agents. Hosted Crawl4AI in a real Chromium browser — renders JavaScript and SPAs, strips boilerplate, and exports JSON/CSV. Callable over MCP from Claude and Cursor.

Pricing

from $1.00 / 1,000 page converteds

Rating

0.0

(0)

Developer

Bikram

Actor stats

Bookmarked

Total users

Monthly active users

4 days ago

Last modified

Website to Markdown for LLM & RAG — Crawl4AI URL to Clean Markdown

Convert any URL, sitemap, or whole website into clean, LLM-ready Markdown — without installing or hosting anything. This Actor is a hosted Crawl4AI: it wraps the popular open-source crawler and runs it on Apify with a real Chromium browser, so JavaScript-heavy pages render correctly. Point it at a page, a sitemap, or a whole site and get back boilerplate-free Markdown ready for RAG pipelines, vector databases, fine-tuning datasets, or pasting straight into an LLM context window.

What it does

Turns URLs → clean Markdown (or Markdown + cleaned HTML, or Markdown + metadata/links JSON)
Strips navigation, footers, cookie banners, and sidebars, leaving "fit markdown" optimized for token budgets
Renders pages in a real Chromium browser via Playwright, so SPAs and JavaScript-rendered content convert correctly
Works on single pages, full sitemaps, or breadth-first same-domain crawls (up to 1,000 pages per run)
Writes one queryable dataset item per page — export as JSON, CSV, Excel, or via the Apify API
Callable as an MCP tool from Claude, Cursor, or any MCP client
Charges only for pages that convert successfully — failed, errored, timed-out, and robots-blocked pages are free

How it works

You provide one or more Start URLs and a crawl mode (single, sitemap, or crawl).
The Actor opens each page in headless Chromium and waits for it to render.
Crawl4AI's pruning content filter removes boilerplate (when removeBoilerplate is on) to produce "fit markdown".
Each successfully converted page is pushed to the dataset and one result-item event is charged (plus a single actor-start event when the run begins).
Pages that fail to load, return an HTTP 4xx/5xx, time out, or are disallowed by robots.txt are logged and never charged.
The run stops at maxPages or when your configured max cost is reached, whichever comes first.

Input

Field	Type	Default	Description
`startUrls`	array	— (required)	URLs to convert. In `single` mode each is converted as-is; in `sitemap` mode each is treated as / resolved to a `sitemap.xml`; in `crawl` mode each is a crawl starting point.
`crawlMode`	string	`single`	`single` (only the listed URLs), `sitemap` (pages from each site's sitemap.xml), or `crawl` (follow same-domain links, breadth-first).
`maxPages`	integer	`10`	Max pages converted across the whole run (1–1000). You're only charged for successful conversions.
`includeLinks`	boolean	`false`	Keep hyperlinks in the Markdown. Disable for cleaner text aimed at embeddings/RAG chunking.
`outputFormat`	string	`markdown`	`markdown`, `markdown+html` (adds cleaned HTML), or `markdown+json` (adds page metadata + link lists).
`removeBoilerplate`	boolean	`true`	Strip nav/footer/cookie-banner noise to produce "fit markdown".
`respectRobotsTxt`	boolean	`true`	Skip pages disallowed by `robots.txt` (skipped pages are not charged).
`proxyConfiguration`	object	none	Optionally route browser traffic through Apify Proxy or custom proxies. Not needed for most public sites.

Input example

{
    "startUrls": [{ "url": "https://docs.crawl4ai.com" }],
    "crawlMode": "crawl",
    "maxPages": 50,
    "outputFormat": "markdown",
    "removeBoilerplate": true,
    "respectRobotsTxt": true
}

Output fields

Each successfully converted page becomes one dataset item. These fields are always present:

Field	Type	Description
`url`	string	The final URL of the converted page.
`title`	string \| null	Page title from the page metadata (`null` if the page has none).
`markdown`	string	The clean Markdown. "Fit markdown" when `removeBoilerplate` is on, otherwise the full raw Markdown.
`wordCount`	integer	Word count of the `markdown` field.
`crawledAt`	string	ISO-8601 UTC timestamp of when the page was converted.

Additional fields appear depending on outputFormat:

Field	Appears when	Description
`html`	`outputFormat: "markdown+html"`	The cleaned HTML of the page.
`metadata`	`outputFormat: "markdown+json"`	Page metadata object (description, Open Graph tags, etc.).
`links.internal`	`outputFormat: "markdown+json"`	Array of internal link URLs found on the page.
`links.external`	`outputFormat: "markdown+json"`	Array of external link URLs found on the page.

Output example

{
    "url": "https://docs.crawl4ai.com/core/quickstart/",
    "title": "Quick Start - Crawl4AI Documentation",
    "markdown": "# Getting Started with Crawl4AI\n\nWelcome to Crawl4AI, an open-source LLM-friendly Web Crawler & Scraper...",
    "wordCount": 1183,
    "crawledAt": "2026-06-13T10:42:07.512345+00:00"
}

Use cases

RAG / AI engineer — Ingest a documentation site or knowledge base into a vector database. Use sitemap or crawl mode with removeBoilerplate: true so chunks contain content, not nav menus.
AI agent builder — Give an agent a "read this page" tool over MCP. The agent passes a URL, gets clean Markdown back, and reasons over it — no scraping code in your app.
LLM app developer — Pull live web content into a prompt at request time via the Apify API instead of pasting HTML and burning tokens on boilerplate.
Data / ML team — Build fine-tuning or evaluation datasets from public web pages, exported as JSON/CSV from the run's dataset.
Researcher / analyst — Convert a batch of articles or report pages to Markdown for summarization, search, or archival in a single run.

Pricing — pay only for pages you convert

This Actor uses Apify's pay-per-event model with two events:

Event	Price	When it's charged
`actor-start`	$0.01	Once per run, when the Actor starts
`result-item`	$0.003	Once per page successfully converted to Markdown

So a run costs $0.01 to start, then $0.003 per converted page (about $3 per 1,000 pages). Pages that fail to load, return an HTTP error, time out, or are blocked by robots.txt are never charged a result-item. Standard Apify platform usage (compute, and proxy if you enable it) applies to runs as usual. You can set a maximum cost per run in Apify Console — the Actor stops gracefully when that limit is reached.

Use from Claude, Cursor & other AI agents (MCP)

This Actor works as a tool over the Model Context Protocol. Add Apify's MCP server to your client and your agent can convert URLs to Markdown on demand:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com/sse?actors=bikram07/web-to-markdown-crawl4ai",
            "headers": {
                "Authorization": "Bearer YOUR_APIFY_TOKEN"
            }
        }
    }
}

Then ask your agent things like: "Fetch https://example.com/blog as Markdown and summarize it" — the agent calls this Actor, gets clean Markdown back, and works with it directly.

You can also call it from code via the Apify API:

curl -X POST "https://api.apify.com/v2/acts/bikram07~web-to-markdown-crawl4ai/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"startUrls": [{"url": "https://example.com"}], "crawlMode": "single"}'

FAQ

Is this a subscription? No. It's pay-per-event with no monthly fee. Each run charges a small $0.01 actor-start fee, then $0.003 for each page that successfully converts — nothing else from this Actor. Convert nothing successfully, and you pay only the start fee.

How does the pricing / billing work, and can I get a refund? You're charged one actor-start event ($0.01) per run and one result-item event ($0.003) per successfully converted page. Failed, errored, timed-out, and robots-blocked pages cost no result-item. Because per-page charges only accrue on successful output, there's nothing to refund for failures. To cap spend, set a maximum cost per run in Apify Console — the Actor stops cleanly when the limit is hit.

Does it use official APIs? There is no public "URL-to-Markdown API" to call — the Actor renders each page in a real Chromium browser (via Playwright) and converts the rendered content using the open-source Crawl4AI library. It respects robots.txt by default. Output is the real content of the pages you point it at.

Does it handle JavaScript-rendered pages and SPAs? Yes. Pages are rendered in headless Chromium before conversion, so client-side-rendered content is included — unlike simple HTML-to-Markdown converters that only see the initial HTML.

What's the difference between this and running Crawl4AI locally? The conversion engine is the same library. The difference is operational: no Python/Playwright setup, no server to maintain, an instant REST API and MCP endpoint, parallel scaling, and dataset storage with JSON/CSV export. If you convert millions of pages a month on dedicated hardware, self-hosting can be cheaper; for prototypes through moderate-volume production RAG ingestion, hosted is simpler.

What it does NOT do (limitations)

Not a search engine or content discoverer. It converts the URLs you give it (or links/sitemap entries it follows in crawl/sitemap mode) — it won't find pages from a keyword.
Crawl mode follows same-domain links only, breadth-first, up to depth 3 and up to maxPages. It does not crawl across external domains.
Sitemap mode needs a real sitemap. If a site exposes no sitemap.xml (and you didn't pass a sitemap URL directly), use single or crawl mode instead.
Hard cap of 1,000 pages per run. For larger jobs, split across multiple runs.
No login / form / paywall handling. Pages behind authentication or interactive walls won't convert.
robots.txt is respected by default. Disallowed pages are skipped (and not charged) unless you turn that off.
It does not extract data into custom schemas — output is Markdown (plus optional HTML/metadata/links), not arbitrary structured fields.

Built on Crawl4AI (Apache 2.0). This Actor is not affiliated with the Crawl4AI project; it packages the library as a hosted service.

Web to Markdown Converter: AI-Ready Scraper for RAG & LLMs

raional/web-to-markdown-converter

Convert any webpage into clean Markdown or JSON for AI, RAG, and LLM pipelines. Strips ads, navigation, and cookie banners. Optionally follows links to convert an entire site. Powered by the open-source Crawl4AI library.

Raion Al

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Crawler - Markdown & Text for LLM / RAG

pear_fight/website-content-crawler-markdown-text-for-llm-rag

Crawl any website and extract clean article text and Markdown, ready to feed into LLMs, ChatGPT, vector databases and RAG pipelines. Removes navigation, ads and boilerplate. Configurable crawl depth and page limits. Export to JSON, CSV, Excel.

Harald

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

Daniel Brenner

Website Content Crawler for AI & LLM Data

your_scraper_guy/website-content-crawler-lite

Crawl any website from a seed URL and extract clean Markdown content, ready for LLM training data, RAG pipelines, and vector databases. Set crawl depth, page limits, and domain scope.

Code With Aqib

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

AI Web Extractor: URL → Clean Markdown + JSON for LLM/RAG

boxbox10/ai-web-extractor

Turn any URL into clean, LLM-ready Markdown + structured JSON (title, headings, main content, links, metadata, token count). Perfect for RAG pipelines, AI agents, and LLM context.

Marvin Eguilos

Website to Markdown

cool_ya/website-to-markdown

Convert any web page into clean, LLM-ready Markdown. Strips nav, ads and boilerplate and returns the main article text plus title, description and word count. Perfect for RAG and AI pipelines.

Y A

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

LLM-Ready Web Extractor: URL to Clean Markdown & JSON

f0rty7even/llm-web-extractor

Turn any web page or site into clean, LLM-ready Markdown and structured JSON for RAG, agents, and fine-tuning. Strips nav/ads/boilerplate; returns main content + metadata.