Website to Markdown – Clean LLM & RAG Content Extractor
Pricing
from $2.00 / 1,000 results
Website to Markdown – Clean LLM & RAG Content Extractor
Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.
Pricing
from $2.00 / 1,000 results
Rating
0.0
(0)
Developer
Daniel Brenner
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Website to Markdown — Clean LLM & RAG Content Extractor
Convert any public web page to clean, LLM-ready Markdown — by URL, a list of URLs, or a whole-site crawl. No API key, no browser to manage. The tool fetches the page, strips navigation, ads and boilerplate, and returns tidy Markdown plus structured metadata — ideal for feeding LLMs, building RAG pipelines, archiving articles, or training datasets.
It's the no-fuss answer to "how do I turn a website into Markdown for ChatGPT / a vector database / an AI agent?" — give it URLs, get Markdown back as JSON, CSV or Excel.
What it does
- HTML → clean Markdown, with headings, lists, tables, code blocks and links preserved (GitHub-flavored).
- Main-content extraction — removes menus, headers, footers, sidebars, cookie banners and ads so the Markdown is just the article, not the chrome. (Or keep the full page if you prefer.)
- Structured metadata per page — title, description, author, published date, language, site name, canonical URL.
- RAG-ready extras — optional token-count estimate and overlapping chunks for retrieval-augmented generation.
- Three modes — a single page, a list of pages, or crawl a site (same-domain link following, depth- and page-limited).
- Respects
robots.txtand sends a descriptive User-Agent. Public pages only — no logins, no paywalls.
What you get (per page)
| field | description |
|---|---|
url / finalUrl | requested URL, and the final URL after redirects (if different) |
title, description, author, publishedDate, lang, siteName | page metadata (honest-null when not present) |
canonicalUrl | the page's canonical link, if declared |
markdown | the clean, LLM-ready Markdown |
wordCount, tokenEstimate | length, plus an approximate LLM token count (a heuristic estimate, not exact) |
chunks | optional RAG chunks ({index, text}) when "Chunk for RAG" is on |
links, images | optional lists of absolute links / image URLs on the page |
httpStatus, fetchedAt, robotsAllowed | response status, fetch time, robots check result |
Anything a page doesn't expose comes back as null — never guessed.
How to use it
Single page or a list:
{ "startUrls": ["https://example.com/blog/post"], "mode": "single" }
{ "startUrls": ["https://a.com/p1", "https://a.com/p2"], "mode": "list" }
Crawl a site for a RAG dataset:
{"startUrls": ["https://docs.example.com/"],"mode": "crawl","maxCrawlDepth": 2,"maxPages": 100,"sameDomainOnly": true,"chunkForRag": true,"chunkSize": 2000,"chunkOverlap": 200}
Set contentMode to main (default — just the article) or full (the whole page). Toggle includeLinks / includeImages to also collect a page's links and images.
Why this tool
- No API key, no headless browser to babysit — give it URLs, get Markdown.
- Clean output — boilerplate-stripped main content, not a raw HTML dump.
- LLM-first — Markdown + metadata + optional chunking and token estimate, the exact shape RAG and fine-tuning pipelines want.
- Polite & transparent — respects
robots.txt, identifies itself, fetches only public pages one at a time.
Pricing is pay-per-result: $2 per 1,000 pages — you only pay for pages successfully converted. Export as JSON, CSV or Excel.
Responsible use
This tool reformats the public pages you point it at — the same content a browser would load. You're responsible for how you use content from sites you don't own (respect each site's terms and applicable copyright). The tool honors robots.txt and never bypasses logins or paywalls.
FAQ
Do I need an API key?
No. Give it one or more URLs and run it — no key, no quota.
Can it convert a whole website, not just one page?
Yes — use crawl mode with a start URL. It follows same-domain links up to the depth and page limits you set, converting each page to Markdown.
Is the Markdown ready for an LLM / RAG?
Yes — that's the point. You get clean Markdown plus metadata, an approximate token count, and optional overlapping chunks for retrieval-augmented generation.
How much does it cost?
Pay-per-result: $2 per 1,000 pages — you only pay for the pages you actually get.