Website to Markdown - Clean LLM-Ready Content avatar

Website to Markdown - Clean LLM-Ready Content

Pricing

Pay per usage

Go to Apify Store
Website to Markdown - Clean LLM-Ready Content

Website to Markdown - Clean LLM-Ready Content

Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

C. K.

C. K.

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 hours ago

Last modified

Categories

Share

Website to Markdown — Clean, LLM-Ready Content Extraction

Convert any webpage or website into clean markdown, stripped of navigation, ads, sidebars, and boilerplate. Output drops straight into any RAG pipeline, LLM context window, or vector store without cleanup. Token counts included so you can plan your embedding budget.

What it does

Most web scrapers give you raw HTML or a wall of unstructured text. You then spend hours cleaning, reformatting, and fixing broken context. This Actor eliminates that step.

Give it a URL. It crawls the site, strips all chrome (navigation, sidebars, footers, cookie banners), and converts each page to clean markdown preserving headings, code blocks, tables, lists, and links. Every page includes a token count (cl100k_base encoding) so you know exactly what it costs to embed or send to an LLM.

Output format

FieldTypeDescription
urlstringSource URL of the page
titlestringPage title
contentstringClean markdown content
token_countintegerToken count (cl100k_base encoding)
content_lengthintegerCharacter count
meta_descriptionstringPage meta description (if available)

Input parameters

ParameterTypeDefaultDescription
startUrlstringURL to start crawling from
urlsarrayList of specific URLs to convert (batch mode)
maxPagesinteger50Maximum pages to convert
crawlSameDomainbooleantrueStay within the start URL's domain
pathPrefixstring""Only crawl paths starting with this prefix
outputFormatstring"markdown""markdown" or "plain_text"
includeMetadatabooleantrueInclude token count and meta description

Example usage

Single page

{
"startUrl": "https://docs.python.org/3/library/asyncio.html",
"maxPages": 1
}

Batch conversion

{
"urls": [
"https://example.com/page-1",
"https://example.com/page-2",
"https://example.com/page-3"
],
"maxPages": 3
}

Full site crawl

{
"startUrl": "https://fastapi.tiangolo.com/",
"maxPages": 100,
"pathPrefix": "/tutorial/"
}

Pricing

This Actor uses the pay-per-event model. You are charged per page successfully converted to markdown. No charge for pages that are skipped (empty, non-content).

How it works

  1. Crawl — Crawlee handles the URL queue, deduplication, rate limiting, and robots.txt compliance.
  2. Clean — Strips navigation, sidebars, footers, cookie banners, and boilerplate using curated selectors. Falls back to <article>, <main>, or <body>.
  3. Convert — Transforms clean HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
  4. Count — Uses cl100k_base (GPT-4 / modern embedding encoding) for accurate token counts.

Responsible use

  • This Actor respects robots.txt by default (enforced by Crawlee).
  • Crawlee's built-in autoscaling keeps request rates reasonable.
  • You are responsible for ensuring your use complies with the target site's Terms of Service.

Built with