AI Web to Markdown - LLM-Ready Extractor avatar

AI Web to Markdown - LLM-Ready Extractor

Pricing

Pay per event

Go to Apify Store
AI Web to Markdown - LLM-Ready Extractor

AI Web to Markdown - LLM-Ready Extractor

Convert any URL into clean LLM-ready markdown. Strips ads, nav, footer. Preserves headings, lists, tables, code blocks. Returns token count. Perfect for RAG, fine-tuning, AI agents. 10x cheaper than Firecrawl.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Mohieldin Mohamed

Mohieldin Mohamed

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

AI Web to Markdown — LLM-Ready Content Extractor

Convert any URL into clean markdown your LLM can actually read. 10x cheaper than Firecrawl, perfect for RAG, fine-tuning, and AI agent context.

This actor takes a list of URLs, fetches each one, strips out ads / navigation / footers / scripts, extracts the main article content using smart heuristics, and converts the result into beautifully clean markdown that's optimized for LLM consumption. Each output includes a token count so you can budget your context windows precisely.

What does AI Web to Markdown do?

You point it at any URL — a blog post, a documentation page, a Wikipedia article, a news story, a product page — and it returns:

  • The main content as clean markdown (headings, lists, tables, code blocks all preserved)
  • YAML frontmatter with the page's title, description, author, publish date, language, and source URL
  • Word count and estimated token count so you know exactly how much context window the page will consume

Try it: paste any URL into the Start URLs field and press Start. Within seconds you get back a structured row that's ready to drop straight into your RAG pipeline, your fine-tuning dataset, or your AI agent's context window.

Apify platform advantages include scheduled runs (re-extract every day to catch updates), API access (pull the dataset directly into your training pipeline), proxy rotation when needed, and parallel extraction of thousands of URLs in one run.

Why use AI Web to Markdown?

  • Build RAG systems on the cheap. Firecrawl charges $19+/month for similar functionality. This actor is pay-per-event at $0.005/page — the entire Wikipedia AI articles set costs ~$5.
  • Fine-tune domain-specific LLMs. Convert thousands of niche-domain articles into clean training data in one batch.
  • Pre-process AI agent context. Don't waste tokens on ads and nav — feed only the content that matters.
  • Bulk content audit. Extract every page on a competitor's site and analyze with an LLM.
  • Backup your own content. Snapshot a website's articles into clean markdown for archival.
  • Migrate from old CMS to new. Get every blog post out of an old site as portable markdown.

How to use

  1. Click Try for free (or Start)
  2. Paste one or more URLs into Start URLs
  3. Optionally tweak settings (strip boilerplate, preserve links/images, max length)
  4. Click Start
  5. Download the dataset as JSON, CSV, or Excel — or pull it directly via the Apify API

Input

  • Start URLs — one or more URLs to convert (each becomes one dataset row)
  • Strip ads, nav, footer, boilerplate — recommended on for clean RAG output (default: yes)
  • Preserve links — keep [text](url) markdown links (default: yes)
  • Preserve images — keep ![alt](url) references (default: yes)
  • Include metadata — attach YAML frontmatter (default: yes)
  • Max length — truncate output to N characters (default: unlimited)
  • Proxy configuration — optional Apify Proxy for blocked sites

Output

{
"url": "https://en.wikipedia.org/wiki/Model_Context_Protocol",
"sourceUrl": "https://en.wikipedia.org/wiki/Model_Context_Protocol",
"title": "Model Context Protocol - Wikipedia",
"description": "An open protocol for connecting AI agents to data sources and tools.",
"author": null,
"publishedAt": "2024-11-25T00:00:00Z",
"siteName": "Wikipedia",
"language": "en",
"wordCount": 2147,
"estimatedTokens": 2580,
"markdown": "---\nurl: \"https://en.wikipedia.org/...\"\ntitle: \"Model Context Protocol - Wikipedia\"\n---\n\n# Model Context Protocol\n\nThe **Model Context Protocol** (MCP) is an open standard...",
"extractedAt": "2026-04-15T19:00:00.000Z"
}

Data table

FieldTypeDescription
urlstringThe final URL after redirects
sourceUrlstringThe URL you provided
titlestringPage title (from <title> or OG)
descriptionstringMeta description
authorstringAuthor from meta tags or microdata
publishedAtstringPublication date
siteNamestringSite name from og:site_name
languagestringPage language code
wordCountnumberWord count of the markdown output
estimatedTokensnumberEstimated token count (~4 chars/token)
markdownstringClean LLM-ready markdown
extractedAtstringISO timestamp

Pricing

This actor uses Apify's pay-per-event pricing — you only pay for what you extract:

  • Actor start: $0.01 per run
  • Per page extracted: $0.005 per URL successfully converted

Example costs:

  • 100 blog posts → $0.51
  • 1,000 documentation pages → $5.01
  • 10,000 articles for fine-tuning → $50.01

Compare to Firecrawl at $19/month for 500 credits, or $99/month for 5,000 credits. Pay-per-event is dramatically cheaper for moderate use and dramatically simpler for one-off extractions.

Free Apify tier members get $5/month in platform credits, which covers ~1,000 pages of extraction per month.

Tips and advanced options

  • Disable preserveImages when building text-only training datasets to slim the output
  • Disable preserveLinks for pure plain-text RAG ingestion
  • Use maxLength to enforce a per-page token budget (useful for fixed-context RAG)
  • Combine with the Sitemap URL Extractor to ingest an entire website in two steps
  • Schedule daily runs to keep your RAG dataset fresh as content changes
  • Pipe into Pinecone / Weaviate / Qdrant via Apify webhooks for fully automated RAG ingestion

FAQ and support

How accurate is the boilerplate stripping? Very good for typical blogs, news sites, documentation, and Wikipedia. Less good for heavily templated sites that use unusual class names. If you see junk in the output, disable stripBoilerplate and post-process yourself, or open an issue with the URL.

What's the token count based on? A reliable ~4 chars/token rule of thumb that matches GPT-4, Claude, and Llama tokenizers within ±10%.

Does it follow redirects? Yes. The url field shows the resolved URL.

Does it work on JavaScript-rendered sites? This is an HTTP-based extractor (no browser), so it works on server-rendered HTML. For SPAs and JavaScript-heavy sites, use a browser-based actor.

Found a bug? Open an issue on the Issues tab.