Pricing

Pay per usage

Webpage Content Scraper to Markdown

Focus on cost, Scrape any webpage content into LLM-ready Markdown for RAG. Uses a smart hybrid 6 tier engine: Apify for crawling + Cloudflare Browser API Rendering for perfect extraction. Automatically saves costs by detecting native markdown support.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Søren Riisager

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

🚀 The Ultimate Web-to-Markdown Converter (Cloudflare + Apify)

Turn any website into clean, LLM-ready Markdown while saving 90% on scraping costs.

This Actor uses a smart Sextuple-Tier Architecture to intelligently handle everything from raw files (PDFs/Docs) and free extraction to Cloudflare Browser Rendering ($0.005/page), Apify's Default Browsers, and a Stealth Residential Browser.

It is designed specifically for RAG Pipelines, AI Agents, and Dataset Creation where quality, speed, and cost efficiency are paramount.

🔥 The Perfect Alternative To...

Firecrawl or ScrapingBee (Too expensive for bulk extractions)
Jina.ai (Often hit rate limits on free tiers)
Crawl4AI (Requires self-hosting and server management)

This Actor gives you the same high-quality LLM-ready markdown, but runs effortlessly on Apify's scalable infrastructure for a fraction of the cost.

💡 Why This Actor?

1. 🧹 Pure Signal, Zero Noise (LLM-Ready)

Language models suffer when fed raw HTML full of menus, footer links, cookie banners, and ads. This Actor uses Mozilla Readability to instantly strip out the clutter. You get clean, focused Markdown, saving you massive amounts of LLM input tokens and preventing AI-hallucinations caused by bad context.

Most scrapers are either too simple (fail on JavaScript) or too expensive (always use heavy browsers). We solve this with a "Cost-First, Robustness-Last" strategy, heavily micro-optimized for maximum throughput:

2. 💰 Smart Cost Optimization

We don't just blindly launch a browser. We try the cheapest methods first:

Tier 0: Extracts content from regular websites (HTML), and also automatically detects and parses PDF, Word (DOCX), Excel (XLSX), PPTX, CSV, JSON, and XML files. Includes a Safe-Guard to skip heavy binaries (images, video) instantly.
Tier 1: Checks for native Markdown headers.
Tier 2: Uses a local Readability engine (no browser overhead).
Tier 3: Uses Cloudflare Browser Rendering with TLS Connection Pooling (Keep-Alive) to save handshake-time.
Tier 4: Uses a single, shared Apify Default Browser stripped of UI/GPU overhead to preserve memory allocations tightly.
- Note: Defaults to Datacenter Proxies to keep costs low.
Tier 5: Unleashes a Puppeteer Stealth Browser forcing Residential Proxies to absolutely obliterate Enterprise Bot Protection (Cloudflare, Akamai, Datadome).
- Note: Enabled by default to ensure near 100% success rate on hard domains, but it requires Residential Proxies.

Result: Based on real-world production runs, the blended Apify compute cost is exceptionally low—averaging around $0.27 per 1,000 pages (~$0.00027 per page). You pay pennies for easy sites, and get unparalleled speed / RAM utilization when the Actor falls back to "Heavy Artillery".

3. 🛡️ Anti-Block Handling

If a website blocks our cheap requests (returning 403 Forbidden or 429 Too Many Requests), the Actor automatically fights back:

Detects the Block.
Retries with Tier 3 (Cloudflare) to see if a simple browser pass works.
Escalates to Tier 4 (Apify Datacenter Proxy) to bypass simple blocks.
Escalates to Tier 5 (Puppeteer Stealth + Residential Proxy) if enabled, to beat the toughest WAFs in the world.

Result: Near 100% Success Rate.

🏗️ The Sextuple-Tier Architecture

Tier	Method	Compute Cost	Speed	Best For
0	File Routing & Guard	Low	🚄 Fast	PDF, Office, CSV, JSON, XML. Skips images/video.
1	Native Markdown	Virtually Zero	⚡ Instant	Markdown, TXT, and sites serving raw text.
2	Local Readability	Extremely Low	🚀 Very Fast	Blogs, News, Static HTML sites.
3	Cloudflare Browser	Low	🚄 Fast	SPAs (React/Vue), JS-heavy sites.
4	Apify Browser	Medium	🐢 Slow	Medium Protection, Javascript Heavy Forms.
5	Stealth Puppeteer (Resi)	💸 High	🐢🐢 Very Slow	Stubborn Enterprise Sites, Heavy Anti-Bot Protection. (Default On)

⚙️ Configuration

You have full control. Toggle tiers on/off to fit your budget and needs.

Field	Description
Start URLs	List of URLs to scrape.
Cloudflare Settings	Account ID & API Token (Required for Tier 3).
Enable Tier 0	Extract text from PDF, Word, Excel, PPTX, CSV, JSON, and XML.
Enable Safe-Guard	Skip heavy binary files (Images, Videos, Archives) to save compute.
Enable Tier 1-5	Toggle specific tiers on/off as needed.
Custom User-Agent	Optional: Bypass specific blocks by forcing a custom User-Agent across browsers.
Debug Screenshots	Takes a screenshot on fail (Default: False to save store costs).
Proxy Configuration	Choose proxies. Residential Proxies heavily recommended if relying on Tier 5.
Max Concurrency	Parallel pages. Note: Tier 4 eats RAM, keep low (1-2) if using it heavily.

🔑 Getting Cloudflare Credentials (Required for Tier 3)

To use the Cost-Saving Tier 3, you need a Cloudflare Workers Paid Plan ($5/mo).

Account ID: Found in your Cloudflare Dashboard URL.
API Token: Create a token with Account > Browser Rendering > Edit permissions.

Note: You can disable Tier 3 if you don't have Cloudflare, but you lose the "Cheap Browser" advantage.

📊 Output Format

We provide clean JSON ready for your Vector Database or LLM:

{
  "url": "https://example.com/blog/ai-revolution",
  "meta": {
    "title": "The AI Revolution",
    "description": "How AI is changing the web...",
    "keywords": "AI, LLM, RAG"
  },
  "content": {
    "markdown": "# The AI Revolution\n\nFull article content...",
    "title": "The AI Revolution",
    "source": "cloudflare_browser", // Tells you which Tier succeeded
    "estimatedTokens": 540
  },
  "scrapedAt": "2023-10-27T10:00:00.000Z"
}

🤝 Support & Contact

If you need any custom scraping datasets, spot a missing feature, or require troubleshooting, we are here to help!

Issues: Please open a ticket in the Apify Issues tab.
Custom AI & Scraping Solutions: Reach out at Tulabot.com.

Built with pain by Søren Riisager - Powering the next generation of AI Agents.

Webpage To Clean Markdown

technicaldost/webpage-to-clean-markdown

Technical Dost Solutions

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website To Markdown

swarmgarden/website-to-markdown

Convert any webpage to clean, readable Markdown format. Perfect for content extraction and readability.

Swarm Garden

Webpage to Markdown

extremescrapes/webpage-to-markdown

This actor cost-effectively converts websites into structured markdown optimized for AI processing. It extracts webpage content, formats it into clean markdown, and ensures compatibility with AI models.

Extreme Scrapes

213

5.0

Website to Markdown - Clean LLM-Ready Content

ambitious_door/web-to-markdown

Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.

C. K.

Markdown API

vivid_astronaut/markdown

Fabio Suizu

Webpage to Markdown

epicscrapers/webpage-to-markdown

Get the main content of any page as Markdown. Great for LLMs and AI agent workflows.

Epic Scrapers

Webpage To Markdown

kawsar/webpage-to-markdown

Convert any webpage into clean, structured, LLM-ready Markdown. Handles JavaScript-rendered sites, strips ads and navigation clutter, and outputs metadata alongside content built for RAG pipelines, AI training, SEO audits, and content archiving.