Website to Markdown - Clean LLM-Ready Content
Pricing
Pay per usage
Website to Markdown - Clean LLM-Ready Content
Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.
Website to Markdown — Clean, LLM-Ready Content Extraction
Convert any webpage or website into clean markdown, stripped of navigation, ads, sidebars, and boilerplate. Output drops straight into any RAG pipeline, LLM context window, or vector store without cleanup. Token counts included so you can plan your embedding budget.
What it does
Most web scrapers give you raw HTML or a wall of unstructured text. You then spend hours cleaning, reformatting, and fixing broken context. This Actor eliminates that step.
Give it a URL. It crawls the site, strips all chrome (navigation, sidebars, footers, cookie banners), and converts each page to clean markdown preserving headings, code blocks, tables, lists, and links. Every page includes a token count (cl100k_base encoding) so you know exactly what it costs to embed or send to an LLM.
Output format
| Field | Type | Description |
|---|---|---|
url | string | Source URL of the page |
title | string | Page title |
content | string | Clean markdown content |
token_count | integer | Token count (cl100k_base encoding) |
content_length | integer | Character count |
meta_description | string | Page meta description (if available) |
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrl | string | — | URL to start crawling from |
urls | array | — | List of specific URLs to convert (batch mode) |
maxPages | integer | 50 | Maximum pages to convert |
crawlSameDomain | boolean | true | Stay within the start URL's domain |
pathPrefix | string | "" | Only crawl paths starting with this prefix |
outputFormat | string | "markdown" | "markdown" or "plain_text" |
includeMetadata | boolean | true | Include token count and meta description |
Example usage
Single page
{"startUrl": "https://docs.python.org/3/library/asyncio.html","maxPages": 1}
Batch conversion
{"urls": ["https://example.com/page-1","https://example.com/page-2","https://example.com/page-3"],"maxPages": 3}
Full site crawl
{"startUrl": "https://fastapi.tiangolo.com/","maxPages": 100,"pathPrefix": "/tutorial/"}
Pricing
This Actor uses the pay-per-event model. You are charged per page successfully converted to markdown. No charge for pages that are skipped (empty, non-content).
How it works
- Crawl — Crawlee handles the URL queue, deduplication, rate limiting, and robots.txt compliance.
- Clean — Strips navigation, sidebars, footers, cookie banners, and boilerplate using curated selectors. Falls back to
<article>,<main>, or<body>. - Convert — Transforms clean HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
- Count — Uses
cl100k_base(GPT-4 / modern embedding encoding) for accurate token counts.
Responsible use
- This Actor respects
robots.txtby default (enforced by Crawlee). - Crawlee's built-in autoscaling keeps request rates reasonable.
- You are responsible for ensuring your use complies with the target site's Terms of Service.
Built with
- Crawlee for reliable crawling
- BeautifulSoup for HTML parsing
- tiktoken for token counting