Webpage Content Scraper to Markdown avatar

Webpage Content Scraper to Markdown

Pricing

Pay per usage

Go to Apify Store
Webpage Content Scraper to Markdown

Webpage Content Scraper to Markdown

Focus on cost, Scrape any webpage content into LLM-ready Markdown for RAG. Uses a smart hybrid 6 tier engine: Apify for crawling + Cloudflare Browser API Rendering for perfect extraction. Automatically saves costs by detecting native markdown support.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Søren Riisager

Søren Riisager

Maintained by Community

Actor stats

1

Bookmarked

1

Total users

1

Monthly active users

2 days ago

Last modified

Share

🚀 The Ultimate Web-to-Markdown Converter (Cloudflare + Apify)

Turn any website into clean, LLM-ready Markdown while saving 90% on scraping costs.

This Actor uses a smart Sextuple-Tier Architecture to intelligently handle everything from raw files (PDFs/Docs) and free extraction to Cloudflare Browser Rendering ($0.005/page), Apify's Default Browsers, and a Stealth Residential Browser.

It is designed specifically for RAG Pipelines, AI Agents, and Dataset Creation where quality, speed, and cost efficiency are paramount.


🔥 The Perfect Alternative To...

  • Firecrawl or ScrapingBee (Too expensive for bulk extractions)
  • Jina.ai (Often hit rate limits on free tiers)
  • Crawl4AI (Requires self-hosting and server management)

This Actor gives you the same high-quality LLM-ready markdown, but runs effortlessly on Apify's scalable infrastructure for a fraction of the cost.


💡 Why This Actor?

1. 🧹 Pure Signal, Zero Noise (LLM-Ready)

Language models suffer when fed raw HTML full of menus, footer links, cookie banners, and ads. This Actor uses Mozilla Readability to instantly strip out the clutter. You get clean, focused Markdown, saving you massive amounts of LLM input tokens and preventing AI-hallucinations caused by bad context.

Most scrapers are either too simple (fail on JavaScript) or too expensive (always use heavy browsers). We solve this with a "Cost-First, Robustness-Last" strategy, heavily micro-optimized for maximum throughput:

2. 💰 Smart Cost Optimization

We don't just blindly launch a browser. We try the cheapest methods first:

  • Tier 0: Automatically detects and extracts content from PDF, Word (DOCX), Excel (XLSX), PPTX, CSV, JSON, and XML files. Includes a Safe-Guard to skip heavy binaries (images, video) instantly.
  • Tier 1: Checks for native Markdown headers.
  • Tier 2: Uses a local Readability engine (no browser overhead).
  • Tier 3: Uses Cloudflare Browser Rendering with TLS Connection Pooling (Keep-Alive) to save handshake-time.
  • Tier 4: Uses a single, shared Apify Default Browser stripped of UI/GPU overhead to preserve memory allocations tightly.
    • Note: Defaults to Datacenter Proxies to keep costs low.
  • Tier 5: Unleashes a Puppeteer Stealth Browser forcing Residential Proxies to absolutely obliterate Enterprise Bot Protection (Cloudflare, Akamai, Datadome).
    • Note: Enabled by default to ensure near 100% success rate on hard domains, but it requires Residential Proxies.

Result: Based on real-world production runs, the blended Apify compute cost is exceptionally low—averaging around $0.27 per 1,000 pages (~$0.00027 per page). You pay pennies for easy sites, and get unparalleled speed / RAM utilization when the Actor falls back to "Heavy Artillery".

3. 🛡️ Anti-Block Handling

If a website blocks our cheap requests (returning 403 Forbidden or 429 Too Many Requests), the Actor automatically fights back:

  1. Detects the Block.
  2. Retries with Tier 3 (Cloudflare) to see if a simple browser pass works.
  3. Escalates to Tier 4 (Apify Datacenter Proxy) to bypass simple blocks.
  4. Escalates to Tier 5 (Puppeteer Stealth + Residential Proxy) if enabled, to beat the toughest WAFs in the world.

Result: Near 100% Success Rate.


🏗️ The Sextuple-Tier Architecture

TierMethodCompute CostSpeedBest For
0File Routing & GuardLow🚄 FastPDF, Office, CSV, JSON, XML. Skips images/video.
1Native MarkdownVirtually Zero⚡ InstantMarkdown, TXT, and sites serving raw text.
2Local ReadabilityExtremely Low🚀 Very FastBlogs, News, Static HTML sites.
3Cloudflare BrowserLow🚄 FastSPAs (React/Vue), JS-heavy sites.
4Apify BrowserMedium🐢 SlowMedium Protection, Javascript Heavy Forms.
5Stealth Puppeteer (Resi)💸 High🐢🐢 Very SlowStubborn Enterprise Sites, Heavy Anti-Bot Protection. (Default On)

⚙️ Configuration

You have full control. Toggle tiers on/off to fit your budget and needs.

FieldDescription
Start URLsList of URLs to scrape.
Cloudflare SettingsAccount ID & API Token (Required for Tier 3).
Enable Tier 0Extract text from PDF, Word, Excel, PPTX, CSV, JSON, and XML.
Enable Safe-GuardSkip heavy binary files (Images, Videos, Archives) to save compute.
Enable Tier 1-5Toggle specific tiers on/off as needed.
Custom User-AgentOptional: Bypass specific blocks by forcing a custom User-Agent across browsers.
Debug ScreenshotsTakes a screenshot on fail (Default: False to save store costs).
Proxy ConfigurationChoose proxies. Residential Proxies heavily recommended if relying on Tier 5.
Max ConcurrencyParallel pages. Note: Tier 4 eats RAM, keep low (1-2) if using it heavily.

🔑 Getting Cloudflare Credentials (Required for Tier 3)

To use the Cost-Saving Tier 3, you need a Cloudflare Workers Paid Plan ($5/mo).

  1. Account ID: Found in your Cloudflare Dashboard URL.
  2. API Token: Create a token with Account > Browser Rendering > Edit permissions.

Note: You can disable Tier 3 if you don't have Cloudflare, but you lose the "Cheap Browser" advantage.


📊 Output Format

We provide clean JSON ready for your Vector Database or LLM:

{
"url": "https://example.com/blog/ai-revolution",
"meta": {
"title": "The AI Revolution",
"description": "How AI is changing the web...",
"keywords": "AI, LLM, RAG"
},
"content": {
"markdown": "# The AI Revolution\n\nFull article content...",
"title": "The AI Revolution",
"source": "cloudflare_browser", // Tells you which Tier succeeded
"estimatedTokens": 540
},
"scrapedAt": "2023-10-27T10:00:00.000Z"
}

🤝 Support & Contact

If you need any custom scraping datasets, spot a missing feature, or require troubleshooting, we are here to help!

  • Issues: Please open a ticket in the Apify Issues tab.
  • Custom AI & Scraping Solutions: Reach out at Tulabot.com.

Built with pain by Søren Riisager - Powering the next generation of AI Agents.