Web to Markdown for LLMs
Pricing
Pay per usage
Web to Markdown for LLMs
Convert any URL to clean LLM-ready markdown. 60-70% fewer tokens than raw HTML. Built for AI agents and RAG pipelines.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
George Kioko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Convert any URL to clean, structured markdown optimized for LLM consumption. 85% average token savings vs raw HTML. The open-source Firecrawl alternative on Apify.
Why This Actor?
LLMs choke on raw HTML. Scripts, styles, navigation, ads — all noise that burns tokens and confuses models. This actor strips all that away and returns clean markdown that your AI can actually reason about.
Raw HTML: 67,841 tokens → costs $0.068 per page (GPT-4)Markdown: 6,176 tokens → costs $0.006 per page (GPT-4)↑ 91% savings
How It Works
┌──────────┐ ┌─────────────────┐ ┌──────────────┐│ Any URL │────▶│ Puppeteer │────▶│ Clean ││ │ │ renders page │ │ Markdown │└──────────┘ │ (JavaScript, │ │ + metadata ││ SPAs, dynamic) │ │ + stats │└─────────────────┘ └──────────────┘│┌───────┴───────┐│ Cheerio parses ││ Turndown ││ converts to MD │└───────────────┘Noise removed: scripts, styles, nav, footer, ads, popups, modalsKept: headings, paragraphs, lists, tables, links, images, code blocks
What Data Does It Extract?
| Field | Description |
|---|---|
markdown | Clean, structured markdown content |
title | Page title |
description | Meta description |
author | Article author (when available) |
publishDate | Publication date |
language | Page language |
wordCount | Total words in markdown |
links | All links found (text + href) |
images | All images (src + alt text) |
tableOfContents | Heading structure for navigation |
stats.htmlTokensEstimate | Original HTML token count |
stats.markdownTokensEstimate | Markdown token count |
stats.tokenSavingsPercent | Percentage of tokens saved |
stats.renderTimeMs | Page render time |
Use Cases
-
RAG Pipelines — Feed clean web content into vector databases (Pinecone, Weaviate, Chroma). 85% fewer tokens = 85% lower embedding costs.
-
AI Agent Tool Use — Give your agent a "read the web" tool. Pass any URL, get structured content back. Works with LangChain, LlamaIndex, CrewAI, AutoGen.
-
Content Repurposing — Convert any article/blog into markdown for your CMS, newsletter, or documentation site.
-
Training Data — Build LLM training datasets from web content. Clean markdown = higher quality training data.
-
Competitive Intelligence — Monitor competitor websites and extract structured content for analysis.
Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes* | — | Single URL to convert |
urls | string[] | Yes* | — | Array of URLs for batch processing |
includeLinks | boolean | No | true | Include extracted links in output |
includeImages | boolean | No | true | Include image URLs in output |
includeToc | boolean | No | false | Include table of contents |
waitFor | number | No | 3000 | Wait time (ms) for JS rendering |
*Provide either url or urls
Output Example
{"url": "https://blog.example.com/article","sourceUrl": "https://blog.example.com/article","title": "How AI Agents Read the Web","description": "A guide to building web-reading capabilities for AI agents","author": "Jane Doe","publishDate": "2026-03-25T10:00:00.000Z","language": "en","markdown": "# How AI Agents Read the Web\n\n**Author:** Jane Doe\n**Published:** 2026-03-25\n\n---\n\nAI agents need structured data to reason about web content...","wordCount": 2450,"links": [{"text": "LangChain docs", "href": "https://docs.langchain.com"},{"text": "Vector databases", "href": "https://www.pinecone.io"}],"images": [{"src": "https://blog.example.com/diagram.png", "alt": "Architecture diagram"}],"tableOfContents": [{"level": 1, "text": "How AI Agents Read the Web"},{"level": 2, "text": "The Problem with Raw HTML"},{"level": 2, "text": "The Markdown Solution"}],"stats": {"htmlSize": 245000,"markdownSize": 12400,"htmlTokensEstimate": 61250,"markdownTokensEstimate": 3100,"tokenSavingsPercent": 95,"renderTimeMs": 4200}}
Performance Benchmarks
Tested across 60 diverse websites:
| Site Type | Success Rate | Avg Token Savings | Avg Time |
|---|---|---|---|
| News (BBC, CNN, NYT) | 100% | 94% | 16s |
| Blogs/Articles | 100% | 91% | 8s |
| Documentation | 100% | 92% | 5s |
| Company websites | 100% | 100% | 12s |
| Wikipedia | 100% | 73% | 7s |
| E-commerce | 80% | 90% | 10s |
| Heavy SPAs | 60% | 54% | 6s |
| Overall | 80% | 85% | 10s |
Comparison vs Firecrawl
| Feature | This Actor | Firecrawl |
|---|---|---|
| Token savings | 85% avg | 67% avg |
| Price | $0.003/page | $0.0008-0.005/page |
| JS rendering | Puppeteer (full) | Playwright |
| Free tier | Apify free plan | 500 credits |
| Open source | Yes (Apify) | Partial |
| Batch processing | Yes (urls array) | Yes |
| Standby API | Yes (instant) | Yes |
Standby API (Instant Response)
This actor supports Apify Standby mode for instant HTTP responses:
# Health checkcurl "https://george-the-developer--web-to-markdown-llm.apify.actor/" \-H "Authorization: Bearer YOUR_TOKEN"# Convert a URLcurl "https://george-the-developer--web-to-markdown-llm.apify.actor/convert?url=https://example.com" \-H "Authorization: Bearer YOUR_TOKEN"
Pricing
Pay Per Event: $0.003 per page converted
| Volume | Cost | Savings vs Firecrawl |
|---|---|---|
| 100 pages | $0.30 | — |
| 1,000 pages | $3.00 | — |
| 10,000 pages | $30.00 | — |
No monthly subscription. Pay only for what you use.
Integrations
Works with any tool that can call an HTTP API:
- LangChain: Use as a custom tool in your agent chain
- LlamaIndex: Feed markdown into document loaders
- n8n / Make: HTTP request node → markdown output
- Python:
requests.get()→ JSON with markdown - Node.js:
fetch()→ structured response
FAQ
Q: Does it handle JavaScript-rendered pages? A: Yes. Uses Puppeteer with full Chrome to render JavaScript, SPAs, and dynamic content.
Q: What about pages behind logins? A: Currently extracts public content only. Authenticated scraping is on the roadmap.
Q: How accurate is the token estimate? A: Uses the ~4 chars/token heuristic for English text. Actual token counts may vary by model.
Q: Can I process multiple URLs at once?
A: Yes. Pass an urls array in batch mode for multiple pages.
Support
- GitHub: the-ai-entrepreneur-ai-hub
- Apify Store: george.the.developer
- Twitter: @ai_in_it
Found a bug? Open an issue or DM on Twitter.