AI Web to Markdown - LLM-Ready Extractor
Pricing
Pay per event
AI Web to Markdown - LLM-Ready Extractor
Convert any URL into clean LLM-ready markdown. Strips ads, nav, footer. Preserves headings, lists, tables, code blocks. Returns token count. Perfect for RAG, fine-tuning, AI agents. 10x cheaper than Firecrawl.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Mohieldin Mohamed
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
AI Web to Markdown — LLM-Ready Content Extractor
Convert any URL into clean markdown your LLM can actually read. 10x cheaper than Firecrawl, perfect for RAG, fine-tuning, and AI agent context.
This actor takes a list of URLs, fetches each one, strips out ads / navigation / footers / scripts, extracts the main article content using smart heuristics, and converts the result into beautifully clean markdown that's optimized for LLM consumption. Each output includes a token count so you can budget your context windows precisely.
What does AI Web to Markdown do?
You point it at any URL — a blog post, a documentation page, a Wikipedia article, a news story, a product page — and it returns:
- The main content as clean markdown (headings, lists, tables, code blocks all preserved)
- YAML frontmatter with the page's title, description, author, publish date, language, and source URL
- Word count and estimated token count so you know exactly how much context window the page will consume
Try it: paste any URL into the Start URLs field and press Start. Within seconds you get back a structured row that's ready to drop straight into your RAG pipeline, your fine-tuning dataset, or your AI agent's context window.
Apify platform advantages include scheduled runs (re-extract every day to catch updates), API access (pull the dataset directly into your training pipeline), proxy rotation when needed, and parallel extraction of thousands of URLs in one run.
Why use AI Web to Markdown?
- Build RAG systems on the cheap. Firecrawl charges $19+/month for similar functionality. This actor is pay-per-event at $0.005/page — the entire Wikipedia AI articles set costs ~$5.
- Fine-tune domain-specific LLMs. Convert thousands of niche-domain articles into clean training data in one batch.
- Pre-process AI agent context. Don't waste tokens on ads and nav — feed only the content that matters.
- Bulk content audit. Extract every page on a competitor's site and analyze with an LLM.
- Backup your own content. Snapshot a website's articles into clean markdown for archival.
- Migrate from old CMS to new. Get every blog post out of an old site as portable markdown.
How to use
- Click Try for free (or Start)
- Paste one or more URLs into Start URLs
- Optionally tweak settings (strip boilerplate, preserve links/images, max length)
- Click Start
- Download the dataset as JSON, CSV, or Excel — or pull it directly via the Apify API
Input
- Start URLs — one or more URLs to convert (each becomes one dataset row)
- Strip ads, nav, footer, boilerplate — recommended on for clean RAG output (default: yes)
- Preserve links — keep
[text](url)markdown links (default: yes) - Preserve images — keep
references (default: yes) - Include metadata — attach YAML frontmatter (default: yes)
- Max length — truncate output to N characters (default: unlimited)
- Proxy configuration — optional Apify Proxy for blocked sites
Output
{"url": "https://en.wikipedia.org/wiki/Model_Context_Protocol","sourceUrl": "https://en.wikipedia.org/wiki/Model_Context_Protocol","title": "Model Context Protocol - Wikipedia","description": "An open protocol for connecting AI agents to data sources and tools.","author": null,"publishedAt": "2024-11-25T00:00:00Z","siteName": "Wikipedia","language": "en","wordCount": 2147,"estimatedTokens": 2580,"markdown": "---\nurl: \"https://en.wikipedia.org/...\"\ntitle: \"Model Context Protocol - Wikipedia\"\n---\n\n# Model Context Protocol\n\nThe **Model Context Protocol** (MCP) is an open standard...","extractedAt": "2026-04-15T19:00:00.000Z"}
Data table
| Field | Type | Description |
|---|---|---|
url | string | The final URL after redirects |
sourceUrl | string | The URL you provided |
title | string | Page title (from <title> or OG) |
description | string | Meta description |
author | string | Author from meta tags or microdata |
publishedAt | string | Publication date |
siteName | string | Site name from og:site_name |
language | string | Page language code |
wordCount | number | Word count of the markdown output |
estimatedTokens | number | Estimated token count (~4 chars/token) |
markdown | string | Clean LLM-ready markdown |
extractedAt | string | ISO timestamp |
Pricing
This actor uses Apify's pay-per-event pricing — you only pay for what you extract:
- Actor start: $0.01 per run
- Per page extracted: $0.005 per URL successfully converted
Example costs:
- 100 blog posts → $0.51
- 1,000 documentation pages → $5.01
- 10,000 articles for fine-tuning → $50.01
Compare to Firecrawl at $19/month for 500 credits, or $99/month for 5,000 credits. Pay-per-event is dramatically cheaper for moderate use and dramatically simpler for one-off extractions.
Free Apify tier members get $5/month in platform credits, which covers ~1,000 pages of extraction per month.
Tips and advanced options
- Disable
preserveImageswhen building text-only training datasets to slim the output - Disable
preserveLinksfor pure plain-text RAG ingestion - Use
maxLengthto enforce a per-page token budget (useful for fixed-context RAG) - Combine with the Sitemap URL Extractor to ingest an entire website in two steps
- Schedule daily runs to keep your RAG dataset fresh as content changes
- Pipe into Pinecone / Weaviate / Qdrant via Apify webhooks for fully automated RAG ingestion
FAQ and support
How accurate is the boilerplate stripping? Very good for typical blogs, news sites, documentation, and Wikipedia. Less good for heavily templated sites that use unusual class names. If you see junk in the output, disable stripBoilerplate and post-process yourself, or open an issue with the URL.
What's the token count based on? A reliable ~4 chars/token rule of thumb that matches GPT-4, Claude, and Llama tokenizers within ±10%.
Does it follow redirects? Yes. The url field shows the resolved URL.
Does it work on JavaScript-rendered sites? This is an HTTP-based extractor (no browser), so it works on server-rendered HTML. For SPAs and JavaScript-heavy sites, use a browser-based actor.
Found a bug? Open an issue on the Issues tab.