Pricing

Pay per event

AI Web to Markdown - LLM-Ready Extractor

Convert any URL into clean LLM-ready markdown. Strips ads, nav, footer. Preserves headings, lists, tables, code blocks. Returns token count. Perfect for RAG, fine-tuning, AI agents. 10x cheaper than Firecrawl.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Mohieldin Mohamed

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

AI Web to Markdown — LLM-Ready Content Extractor

Convert any URL into clean markdown your LLM can actually read. 10x cheaper than Firecrawl, perfect for RAG, fine-tuning, and AI agent context.

This actor takes a list of URLs, fetches each one, strips out ads / navigation / footers / scripts, extracts the main article content using smart heuristics, and converts the result into beautifully clean markdown that's optimized for LLM consumption. Each output includes a token count so you can budget your context windows precisely.

What does AI Web to Markdown do?

You point it at any URL — a blog post, a documentation page, a Wikipedia article, a news story, a product page — and it returns:

The main content as clean markdown (headings, lists, tables, code blocks all preserved)
YAML frontmatter with the page's title, description, author, publish date, language, and source URL
Word count and estimated token count so you know exactly how much context window the page will consume

Try it: paste any URL into the Start URLs field and press Start. Within seconds you get back a structured row that's ready to drop straight into your RAG pipeline, your fine-tuning dataset, or your AI agent's context window.

Apify platform advantages include scheduled runs (re-extract every day to catch updates), API access (pull the dataset directly into your training pipeline), proxy rotation when needed, and parallel extraction of thousands of URLs in one run.

Why use AI Web to Markdown?

Build RAG systems on the cheap. Firecrawl charges $19+/month for similar functionality. This actor is pay-per-event at $0.005/page — the entire Wikipedia AI articles set costs ~$5.
Fine-tune domain-specific LLMs. Convert thousands of niche-domain articles into clean training data in one batch.
Pre-process AI agent context. Don't waste tokens on ads and nav — feed only the content that matters.
Bulk content audit. Extract every page on a competitor's site and analyze with an LLM.
Backup your own content. Snapshot a website's articles into clean markdown for archival.
Migrate from old CMS to new. Get every blog post out of an old site as portable markdown.

How to use

Click Try for free (or Start)
Paste one or more URLs into Start URLs
Optionally tweak settings (strip boilerplate, preserve links/images, max length)
Click Start
Download the dataset as JSON, CSV, or Excel — or pull it directly via the Apify API

Input

Start URLs — one or more URLs to convert (each becomes one dataset row)
Strip ads, nav, footer, boilerplate — recommended on for clean RAG output (default: yes)
Preserve links — keep [text](url) markdown links (default: yes)
Preserve images — keep ![alt](url) references (default: yes)
Include metadata — attach YAML frontmatter (default: yes)
Max length — truncate output to N characters (default: unlimited)
Proxy configuration — optional Apify Proxy for blocked sites

Output

{
    "url": "https://en.wikipedia.org/wiki/Model_Context_Protocol",
    "sourceUrl": "https://en.wikipedia.org/wiki/Model_Context_Protocol",
    "title": "Model Context Protocol - Wikipedia",
    "description": "An open protocol for connecting AI agents to data sources and tools.",
    "author": null,
    "publishedAt": "2024-11-25T00:00:00Z",
    "siteName": "Wikipedia",
    "language": "en",
    "wordCount": 2147,
    "estimatedTokens": 2580,
    "markdown": "---\nurl: \"https://en.wikipedia.org/...\"\ntitle: \"Model Context Protocol - Wikipedia\"\n---\n\n# Model Context Protocol\n\nThe **Model Context Protocol** (MCP) is an open standard...",
    "extractedAt": "2026-04-15T19:00:00.000Z"
}

Data table

Field	Type	Description
`url`	string	The final URL after redirects
`sourceUrl`	string	The URL you provided
`title`	string	Page title (from `<title>` or OG)
`description`	string	Meta description
`author`	string	Author from meta tags or microdata
`publishedAt`	string	Publication date
`siteName`	string	Site name from `og:site_name`
`language`	string	Page language code
`wordCount`	number	Word count of the markdown output
`estimatedTokens`	number	Estimated token count (~4 chars/token)
`markdown`	string	Clean LLM-ready markdown
`extractedAt`	string	ISO timestamp

Pricing

This actor uses Apify's pay-per-event pricing — you only pay for what you extract:

Actor start: $0.01 per run
Per page extracted: $0.005 per URL successfully converted

Example costs:

100 blog posts → $0.51
1,000 documentation pages → $5.01
10,000 articles for fine-tuning → $50.01

Compare to Firecrawl at $19/month for 500 credits, or $99/month for 5,000 credits. Pay-per-event is dramatically cheaper for moderate use and dramatically simpler for one-off extractions.

Free Apify tier members get $5/month in platform credits, which covers ~1,000 pages of extraction per month.

Tips and advanced options

Disable preserveImages when building text-only training datasets to slim the output
Disable preserveLinks for pure plain-text RAG ingestion
Use maxLength to enforce a per-page token budget (useful for fixed-context RAG)
Combine with the Sitemap URL Extractor to ingest an entire website in two steps
Schedule daily runs to keep your RAG dataset fresh as content changes
Pipe into Pinecone / Weaviate / Qdrant via Apify webhooks for fully automated RAG ingestion

FAQ and support

How accurate is the boilerplate stripping? Very good for typical blogs, news sites, documentation, and Wikipedia. Less good for heavily templated sites that use unusual class names. If you see junk in the output, disable stripBoilerplate and post-process yourself, or open an issue with the URL.

What's the token count based on? A reliable ~4 chars/token rule of thumb that matches GPT-4, Claude, and Llama tokenizers within ±10%.

Does it follow redirects? Yes. The url field shows the resolved URL.

Does it work on JavaScript-rendered sites? This is an HTTP-based extractor (no browser), so it works on server-rendered HTML. For SPAs and JavaScript-heavy sites, use a browser-based actor.

Found a bug? Open an issue on the Issues tab.

Website to Markdown – Clean LLM & RAG Content Extractor

dataquarry/website-to-markdown

Convert any public web page to clean, LLM-ready Markdown with metadata — by URL, a list of URLs, or a whole-site crawl. Strips nav/ads/boilerplate, keeps headings/lists/tables/code. Respects robots.txt. No API key.

Daniel Brenner

Website to Markdown Converter for LLM Training

pink_comic/website-content-to-markdown

Convert any web page to clean Markdown. Strips nav, ads, scripts, styling. Preserves headings, lists, tables, code blocks, links. Perfect for LLM training data, RAG pipelines, content migration, documentation archival, and text analysis. Bulk processing with word/link/image counts.

Ava Torres

Web to Markdown — AI-Ready Text from Any URL

wsgcjj/web-to-markdown

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

陈俊杰

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.

Juan Triviño

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

LLM-Ready Web Scraper – RAG & Vertical Data Extraction

conceivable_extension/llm-ready-web-scraper

Scrapes any URL and returns clean LLM-ready content. Strips ads, nav, and boilerplate. Returns markdown, chunked text, token estimates, and metadata. Vertical modes for Legal, Medical, Property, E-commerce, Research, and News. Firecrawl alternative at $0.005 per URL.