Web to Markdown for LLMs avatar

Web to Markdown for LLMs

Pricing

Pay per usage

Go to Apify Store
Web to Markdown for LLMs

Web to Markdown for LLMs

Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

George Kioko

George Kioko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Convert any URL to clean, structured markdown optimized for LLM consumption. 85% average token savings vs raw HTML. The open-source Firecrawl alternative on Apify.

Why This Actor?

LLMs choke on raw HTML. Scripts, styles, navigation, ads — all noise that burns tokens and confuses models. This actor strips all that away and returns clean markdown that your AI can actually reason about.

Raw HTML: 67,841 tokens → costs $0.068 per page (GPT-4)
Markdown: 6,176 tokens → costs $0.006 per page (GPT-4)
91% savings

How It Works

┌──────────┐ ┌─────────────────┐ ┌──────────────┐
│ Any URL │────▶│ Puppeteer │────▶│ Clean │
│ │ │ renders page │ │ Markdown │
└──────────┘ │ (JavaScript, │ │ + metadata │
│ SPAs, dynamic) │ │ + stats │
└─────────────────┘ └──────────────┘
┌───────┴───────┐
│ Cheerio parses │
│ Turndown │
│ converts to MD │
└───────────────┘
Noise removed: scripts, styles, nav, footer, ads, popups, modals
Kept: headings, paragraphs, lists, tables, links, images, code blocks

What Data Does It Extract?

FieldDescription
markdownClean, structured markdown content
titlePage title
descriptionMeta description
authorArticle author (when available)
publishDatePublication date
languagePage language
wordCountTotal words in markdown
linksAll links found (text + href)
imagesAll images (src + alt text)
tableOfContentsHeading structure for navigation
stats.htmlTokensEstimateOriginal HTML token count
stats.markdownTokensEstimateMarkdown token count
stats.tokenSavingsPercentPercentage of tokens saved
stats.renderTimeMsPage render time

Use Cases

  1. RAG Pipelines — Feed clean web content into vector databases (Pinecone, Weaviate, Chroma). 85% fewer tokens = 85% lower embedding costs.

  2. AI Agent Tool Use — Give your agent a "read the web" tool. Pass any URL, get structured content back. Works with LangChain, LlamaIndex, CrewAI, AutoGen.

  3. Content Repurposing — Convert any article/blog into markdown for your CMS, newsletter, or documentation site.

  4. Training Data — Build LLM training datasets from web content. Clean markdown = higher quality training data.

  5. Competitive Intelligence — Monitor competitor websites and extract structured content for analysis.

Input Parameters

ParameterTypeRequiredDefaultDescription
urlstringYes*Single URL to convert
urlsstring[]Yes*Array of URLs for batch processing
includeLinksbooleanNotrueInclude extracted links in output
includeImagesbooleanNotrueInclude image URLs in output
includeTocbooleanNofalseInclude table of contents
waitFornumberNo3000Wait time (ms) for JS rendering

*Provide either url or urls

Output Example

{
"url": "https://blog.example.com/article",
"sourceUrl": "https://blog.example.com/article",
"title": "How AI Agents Read the Web",
"description": "A guide to building web-reading capabilities for AI agents",
"author": "Jane Doe",
"publishDate": "2026-03-25T10:00:00.000Z",
"language": "en",
"markdown": "# How AI Agents Read the Web\n\n**Author:** Jane Doe\n**Published:** 2026-03-25\n\n---\n\nAI agents need structured data to reason about web content...",
"wordCount": 2450,
"links": [
{"text": "LangChain docs", "href": "https://docs.langchain.com"},
{"text": "Vector databases", "href": "https://www.pinecone.io"}
],
"images": [
{"src": "https://blog.example.com/diagram.png", "alt": "Architecture diagram"}
],
"tableOfContents": [
{"level": 1, "text": "How AI Agents Read the Web"},
{"level": 2, "text": "The Problem with Raw HTML"},
{"level": 2, "text": "The Markdown Solution"}
],
"stats": {
"htmlSize": 245000,
"markdownSize": 12400,
"htmlTokensEstimate": 61250,
"markdownTokensEstimate": 3100,
"tokenSavingsPercent": 95,
"renderTimeMs": 4200
}
}

Performance Benchmarks

Tested across 60 diverse websites:

Site TypeSuccess RateAvg Token SavingsAvg Time
News (BBC, CNN, NYT)100%94%16s
Blogs/Articles100%91%8s
Documentation100%92%5s
Company websites100%100%12s
Wikipedia100%73%7s
E-commerce80%90%10s
Heavy SPAs60%54%6s
Overall80%85%10s

Comparison vs Firecrawl

FeatureThis ActorFirecrawl
Token savings85% avg67% avg
Price$0.003/page$0.0008-0.005/page
JS renderingPuppeteer (full)Playwright
Free tierApify free plan500 credits
Open sourceYes (Apify)Partial
Batch processingYes (urls array)Yes
Standby APIYes (instant)Yes

Standby API (Instant Response)

This actor supports Apify Standby mode for instant HTTP responses:

# Health check
curl "https://george-the-developer--web-to-markdown-llm.apify.actor/" \
-H "Authorization: Bearer YOUR_TOKEN"
# Convert a URL
curl "https://george-the-developer--web-to-markdown-llm.apify.actor/convert?url=https://example.com" \
-H "Authorization: Bearer YOUR_TOKEN"

Pricing

Pay Per Event: $0.003 per page converted

VolumeCostSavings vs Firecrawl
100 pages$0.30
1,000 pages$3.00
10,000 pages$30.00

No monthly subscription. Pay only for what you use.

Integrations

Works with any tool that can call an HTTP API:

  • LangChain: Use as a custom tool in your agent chain
  • LlamaIndex: Feed markdown into document loaders
  • n8n / Make: HTTP request node → markdown output
  • Python: requests.get() → JSON with markdown
  • Node.js: fetch() → structured response

FAQ

Q: Does it handle JavaScript-rendered pages? A: Yes. Uses Puppeteer with full Chrome to render JavaScript, SPAs, and dynamic content.

Q: What about pages behind logins? A: Currently extracts public content only. Authenticated scraping is on the roadmap.

Q: How accurate is the token estimate? A: Uses the ~4 chars/token heuristic for English text. Actual token counts may vary by model.

Q: Can I process multiple URLs at once? A: Yes. Pass an urls array in batch mode for multiple pages.

Support

Found a bug? Open an issue or DM on Twitter.