Webpage Content Scraper to Markdown
Pricing
Pay per usage
Webpage Content Scraper to Markdown
Focus on cost, Scrape any webpage content into LLM-ready Markdown for RAG. Uses a smart hybrid 6 tier engine: Apify for crawling + Cloudflare Browser API Rendering for perfect extraction. Automatically saves costs by detecting native markdown support.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Søren Riisager
Actor stats
1
Bookmarked
1
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
🚀 The Ultimate Web-to-Markdown Converter (Cloudflare + Apify)
Turn any website into clean, LLM-ready Markdown while saving 90% on scraping costs.
This Actor uses a smart Sextuple-Tier Architecture to intelligently handle everything from raw files (PDFs/Docs) and free extraction to Cloudflare Browser Rendering ($0.005/page), Apify's Default Browsers, and a Stealth Residential Browser.
It is designed specifically for RAG Pipelines, AI Agents, and Dataset Creation where quality, speed, and cost efficiency are paramount.
🔥 The Perfect Alternative To...
- Firecrawl or ScrapingBee (Too expensive for bulk extractions)
- Jina.ai (Often hit rate limits on free tiers)
- Crawl4AI (Requires self-hosting and server management)
This Actor gives you the same high-quality LLM-ready markdown, but runs effortlessly on Apify's scalable infrastructure for a fraction of the cost.
💡 Why This Actor?
1. 🧹 Pure Signal, Zero Noise (LLM-Ready)
Language models suffer when fed raw HTML full of menus, footer links, cookie banners, and ads. This Actor uses Mozilla Readability to instantly strip out the clutter. You get clean, focused Markdown, saving you massive amounts of LLM input tokens and preventing AI-hallucinations caused by bad context.
Most scrapers are either too simple (fail on JavaScript) or too expensive (always use heavy browsers). We solve this with a "Cost-First, Robustness-Last" strategy, heavily micro-optimized for maximum throughput:
2. 💰 Smart Cost Optimization
We don't just blindly launch a browser. We try the cheapest methods first:
- Tier 0: Automatically detects and extracts content from PDF, Word (DOCX), Excel (XLSX), PPTX, CSV, JSON, and XML files. Includes a Safe-Guard to skip heavy binaries (images, video) instantly.
- Tier 1: Checks for native Markdown headers.
- Tier 2: Uses a local Readability engine (no browser overhead).
- Tier 3: Uses Cloudflare Browser Rendering with TLS Connection Pooling (Keep-Alive) to save handshake-time.
- Tier 4: Uses a single, shared Apify Default Browser stripped of UI/GPU overhead to preserve memory allocations tightly.
- Note: Defaults to Datacenter Proxies to keep costs low.
- Tier 5: Unleashes a Puppeteer Stealth Browser forcing Residential Proxies to absolutely obliterate Enterprise Bot Protection (Cloudflare, Akamai, Datadome).
- Note: Enabled by default to ensure near 100% success rate on hard domains, but it requires Residential Proxies.
Result: Based on real-world production runs, the blended Apify compute cost is exceptionally low—averaging around $0.27 per 1,000 pages (~$0.00027 per page). You pay pennies for easy sites, and get unparalleled speed / RAM utilization when the Actor falls back to "Heavy Artillery".
3. 🛡️ Anti-Block Handling
If a website blocks our cheap requests (returning 403 Forbidden or 429 Too Many Requests), the Actor automatically fights back:
- Detects the Block.
- Retries with Tier 3 (Cloudflare) to see if a simple browser pass works.
- Escalates to Tier 4 (Apify Datacenter Proxy) to bypass simple blocks.
- Escalates to Tier 5 (Puppeteer Stealth + Residential Proxy) if enabled, to beat the toughest WAFs in the world.
Result: Near 100% Success Rate.
🏗️ The Sextuple-Tier Architecture
| Tier | Method | Compute Cost | Speed | Best For |
|---|---|---|---|---|
| 0 | File Routing & Guard | Low | 🚄 Fast | PDF, Office, CSV, JSON, XML. Skips images/video. |
| 1 | Native Markdown | Virtually Zero | ⚡ Instant | Markdown, TXT, and sites serving raw text. |
| 2 | Local Readability | Extremely Low | 🚀 Very Fast | Blogs, News, Static HTML sites. |
| 3 | Cloudflare Browser | Low | 🚄 Fast | SPAs (React/Vue), JS-heavy sites. |
| 4 | Apify Browser | Medium | 🐢 Slow | Medium Protection, Javascript Heavy Forms. |
| 5 | Stealth Puppeteer (Resi) | 💸 High | 🐢🐢 Very Slow | Stubborn Enterprise Sites, Heavy Anti-Bot Protection. (Default On) |
⚙️ Configuration
You have full control. Toggle tiers on/off to fit your budget and needs.
| Field | Description |
|---|---|
| Start URLs | List of URLs to scrape. |
| Cloudflare Settings | Account ID & API Token (Required for Tier 3). |
| Enable Tier 0 | Extract text from PDF, Word, Excel, PPTX, CSV, JSON, and XML. |
| Enable Safe-Guard | Skip heavy binary files (Images, Videos, Archives) to save compute. |
| Enable Tier 1-5 | Toggle specific tiers on/off as needed. |
| Custom User-Agent | Optional: Bypass specific blocks by forcing a custom User-Agent across browsers. |
| Debug Screenshots | Takes a screenshot on fail (Default: False to save store costs). |
| Proxy Configuration | Choose proxies. Residential Proxies heavily recommended if relying on Tier 5. |
| Max Concurrency | Parallel pages. Note: Tier 4 eats RAM, keep low (1-2) if using it heavily. |
🔑 Getting Cloudflare Credentials (Required for Tier 3)
To use the Cost-Saving Tier 3, you need a Cloudflare Workers Paid Plan ($5/mo).
- Account ID: Found in your Cloudflare Dashboard URL.
- API Token: Create a token with Account > Browser Rendering > Edit permissions.
Note: You can disable Tier 3 if you don't have Cloudflare, but you lose the "Cheap Browser" advantage.
📊 Output Format
We provide clean JSON ready for your Vector Database or LLM:
{"url": "https://example.com/blog/ai-revolution","meta": {"title": "The AI Revolution","description": "How AI is changing the web...","keywords": "AI, LLM, RAG"},"content": {"markdown": "# The AI Revolution\n\nFull article content...","title": "The AI Revolution","source": "cloudflare_browser", // Tells you which Tier succeeded"estimatedTokens": 540},"scrapedAt": "2023-10-27T10:00:00.000Z"}
🤝 Support & Contact
If you need any custom scraping datasets, spot a missing feature, or require troubleshooting, we are here to help!
- Issues: Please open a ticket in the Apify Issues tab.
- Custom AI & Scraping Solutions: Reach out at Tulabot.com.
Built with pain by Søren Riisager - Powering the next generation of AI Agents.