Website to Markdown – Clean LLM & RAG Content Extractor avatar

Website to Markdown – Clean LLM & RAG Content Extractor

Pricing

from $2.00 / 1,000 results

Go to Apify Store
Website to Markdown – Clean LLM & RAG Content Extractor

Website to Markdown – Clean LLM & RAG Content Extractor

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

Pricing

from $2.00 / 1,000 results

Rating

0.0

(0)

Developer

Daniel Brenner

Daniel Brenner

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Website to Markdown — Clean LLM & RAG Content Extractor

Convert any public web page to clean, LLM-ready Markdown — by URL, a list of URLs, or a whole-site crawl. No API key, no browser to manage. The tool fetches the page, strips navigation, ads and boilerplate, and returns tidy Markdown plus structured metadata — ideal for feeding LLMs, building RAG pipelines, archiving articles, or training datasets.

It's the no-fuss answer to "how do I turn a website into Markdown for ChatGPT / a vector database / an AI agent?" — give it URLs, get Markdown back as JSON, CSV or Excel.

What it does

  • HTML → clean Markdown, with headings, lists, tables, code blocks and links preserved (GitHub-flavored).
  • Main-content extraction — removes menus, headers, footers, sidebars, cookie banners and ads so the Markdown is just the article, not the chrome. (Or keep the full page if you prefer.)
  • Structured metadata per page — title, description, author, published date, language, site name, canonical URL.
  • RAG-ready extras — optional token-count estimate and overlapping chunks for retrieval-augmented generation.
  • Three modes — a single page, a list of pages, or crawl a site (same-domain link following, depth- and page-limited).
  • Respects robots.txt and sends a descriptive User-Agent. Public pages only — no logins, no paywalls.

What you get (per page)

fielddescription
url / finalUrlrequested URL, and the final URL after redirects (if different)
title, description, author, publishedDate, lang, siteNamepage metadata (honest-null when not present)
canonicalUrlthe page's canonical link, if declared
markdownthe clean, LLM-ready Markdown
wordCount, tokenEstimatelength, plus an approximate LLM token count (a heuristic estimate, not exact)
chunksoptional RAG chunks ({index, text}) when "Chunk for RAG" is on
links, imagesoptional lists of absolute links / image URLs on the page
httpStatus, fetchedAt, robotsAllowedresponse status, fetch time, robots check result

Anything a page doesn't expose comes back as null — never guessed.

How to use it

Single page or a list:

{ "startUrls": ["https://example.com/blog/post"], "mode": "single" }
{ "startUrls": ["https://a.com/p1", "https://a.com/p2"], "mode": "list" }

Crawl a site for a RAG dataset:

{
"startUrls": ["https://docs.example.com/"],
"mode": "crawl",
"maxCrawlDepth": 2,
"maxPages": 100,
"sameDomainOnly": true,
"chunkForRag": true,
"chunkSize": 2000,
"chunkOverlap": 200
}

Set contentMode to main (default — just the article) or full (the whole page). Toggle includeLinks / includeImages to also collect a page's links and images.

Why this tool

  • No API key, no headless browser to babysit — give it URLs, get Markdown.
  • Clean output — boilerplate-stripped main content, not a raw HTML dump.
  • LLM-first — Markdown + metadata + optional chunking and token estimate, the exact shape RAG and fine-tuning pipelines want.
  • Polite & transparent — respects robots.txt, identifies itself, fetches only public pages one at a time.

Pricing is pay-per-result: $2 per 1,000 pages — you only pay for pages successfully converted. Export as JSON, CSV or Excel.

Responsible use

This tool reformats the public pages you point it at — the same content a browser would load. You're responsible for how you use content from sites you don't own (respect each site's terms and applicable copyright). The tool honors robots.txt and never bypasses logins or paywalls.

FAQ

Do I need an API key?

No. Give it one or more URLs and run it — no key, no quota.

Can it convert a whole website, not just one page?

Yes — use crawl mode with a start URL. It follows same-domain links up to the depth and page limits you set, converting each page to Markdown.

Is the Markdown ready for an LLM / RAG?

Yes — that's the point. You get clean Markdown plus metadata, an approximate token count, and optional overlapping chunks for retrieval-augmented generation.

How much does it cost?

Pay-per-result: $2 per 1,000 pages — you only pay for the pages you actually get.