Pricing

$30.00/month + usage

Markdown Scraper (Stealth Browser)

Scrape any website to clean markdown for LLMs, RAG, and AI agents. Stealth browser with ad blocking. Only visible content, no cookie banners or hidden menus. Split into header, body, and footer with links. Token-optimized compact output included.

Pricing

$30.00/month + usage

Rating

0.0

(0)

Developer

Thodor

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What you get

{
  "url": "https://apify.com",
  "title": "Apify: Full-stack web scraping and data extraction platform",
  "header": {
    "markdown": "Contact sales\n\nLog in\n\nGet started",
    "compact": "Contact sales\nLog in\nGet started",
    "links": [
      { "text": "Contact sales", "url": "https://apify.com/contact-sales" },
      { "text": "Log in", "url": "https://console.apify.com/sign-in" },
      { "text": "Get started", "url": "https://console.apify.com/sign-up" }
    ]
  },
  "body": {
    "markdown": "# Get real-time web data for your AI\n\nApify Actors scrape up-to-date web data from any website for AI apps and agents...\n\n### TikTok Scraper\n\nclockworks / tiktok-scraper\n\nExtract data from TikTok videos, hashtags, and users...",
    "compact": "Get real-time web data for your AI\nApify Actors scrape up-to-date web data from any website for AI apps and agents...\nTikTok Scraper\nclockworks / tiktok-scraper\nExtract data from TikTok videos, hashtags, and users...",
    "links": [
      { "text": "TikTok Scraper", "url": "https://apify.com/clockworks/tiktok-scraper" },
      { "text": "Google Maps Scraper", "url": "https://apify.com/compass/crawler-google-places" },
      { "text": "Instagram Scraper", "url": "https://apify.com/apify/instagram-scraper" }
    ]
  },
  "footer": {
    "markdown": "Product\n\n- [Apify Store]\n- [Integrations]\n- [Proxy]\n\nDevelopers\n\n- [Documentation]\n- [Code templates]\n- [API reference]\n\n...",
    "compact": "Product\nApify Store\nIntegrations\nProxy\nDevelopers\nDocumentation\nCode templates\nAPI reference\n...",
    "links": [
      { "text": "Apify Store", "url": "https://apify.com/store" },
      { "text": "Documentation", "url": "https://docs.apify.com/" },
      { "text": "Terms of Use", "url": "https://docs.apify.com/legal/general-terms-and-conditions" }
    ]
  }
}

The page is split into header, body, and footer. Each section has its own markdown, a token-optimized compact version (no formatting, minimal whitespace), and a list of links. Use markdown when you need structure, compact when you want to save tokens.

Why this one?

Only visible content. Opens the page in a real browser and extracts only what a visitor would actually see. Hidden dropdown menus, cookie consent popups, and invisible elements are filtered out.
Ads and trackers blocked. uBlock Origin is built into the browser. Ads, tracking pixels, and cookie walls never make it into your output.
Stealth browser. Uses anti-fingerprinting so the browser looks like a regular visitor. Enable residential proxies for extra-stubborn sites.
Clean heading structure. Heading levels are compressed to remove gaps and stay within 1-6. If a page uses h1, h4, h5, they become h1, h2, h3.
Links extracted separately. Links are pulled out as structured data (text + URL) per section, deduplicated and capped at a configurable limit.
Multiple URLs in one run. Pass a list of URLs to process them all in a single run. The browser session is reused, so you only pay for one startup instead of one per page.
Token-optimized output. Each section includes a compact field with all markdown formatting stripped. Same content, fewer tokens.
Low cost per run. Images are swapped for tiny placeholders instead of downloaded. Fonts and media are skipped entirely.

Under the hood

Stealth browser. Runs on Camoufox, a hardened Firefox fork with anti-fingerprinting. Sites see a normal visitor, not a bot.
uBlock Origin built in. Ads, trackers, and cookie walls are blocked before the page even renders.
Visibility filtering. Uses the browser's own visibility engine to skip hidden elements (collapsed menus, cookie popups, off-screen content, display: none blocks). Only what's actually on screen gets extracted.
Image placeholders. Images are replaced with a tiny SVG instead of downloaded. Keeps page layouts intact while cutting bandwidth and cost.
Layout table handling. Old-school sites that use <table> for layout (like Hacker News) are detected automatically. Content comes out as clean text, not broken markdown tables.

Input

Field	Type	Description
`url`	string	A single URL to convert.
`urls`	string[]	Multiple URLs to convert in one run. Saves cost by reusing the browser session.
`proxyConfiguration`	object	Enable proxies to avoid blocks. Use residential proxies for sites that block datacenter IPs.
`waitForSelector`	string	Wait for a specific element to appear before extracting. Useful for pages that load content dynamically.
`removeSelectors`	string[]	Elements to remove before extraction (e.g. sidebar, comments, ad banners).
`maxLinks`	integer	Maximum number of links to extract across all sections. Default: `100`. Set to `0` for no links.

Tips

Try without proxies first. The stealth browser gets through most anti-bot protections on its own. Only enable proxies if you're getting blocked. Start with datacenter proxies, and switch to residential only if needed.
How it waits. The Actor waits for the page DOM to load, then gives JavaScript 3 seconds to render dynamic content. If you set waitForSelector, it will also wait up to 10 seconds for that element to appear. For most sites, the default behavior is enough.
Multiple URLs. You can use url and urls together. Duplicates are removed automatically. Pages are processed one at a time to keep things stable.
Automatic retries. If a page fails to load, it retries up to 3 times before moving on to the next URL.

Use cases

RAG pipelines. Feed clean page content into vector databases.
AI agents. Give agents focused page content with navigation separated from main content.
Content monitoring. Track changes in the body without nav/footer noise.
Competitive analysis. Extract and compare landing pages across competitors.
Dataset building. Create clean text datasets from web pages.

Missing something?

This is a solo project and I'm actively building on it. If you need a feature, have a bug, or want something changed, just open an issue or message me directly. I read everything and ship fast.

RAG-Markdown Extractor

hachi-dev/rag-markdown-extractor

The ultimate web-to-markdown tool for AI builders. Extracts clean content from any site, auto-dismisses cookie banners, and handles SPAs with Playwright. Optimized for LangChain, LlamaIndex, and RAG pipelines. Save token costs with 99% noise-free markdown.

JI JUN

Rag Web Browser

opalescent_quintet/rag-web-browser

A specialized crawler designed exclusively to feed LLMs. It visits a website and extracts core content into clean, token-optimized Markdown, stripping all "junk" (navs, footers, ads, cookie banners).

Aryan

Stealth Scraper

shvmgrx/stealth-scraper

Shivam Goraksha

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Web Scraper For Llms

abotapi/web-scraper-for-llms

Stealth web scraping engine built for LLMs. Converts any web page to clean markdown or HTML

AbotAPI

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Markdown Maker: HTML to Markdown 📝

shahidirfan/Markdown-Maker

Instantly convert complex HTML into clean, structured Markdown. This lightweight actor is optimized to render web content into a format that is easily readable for AI LLMs, reducing token usage and improving context. Perfect for RAG pipelines and preparing data for training.

Shahid Irfan

Markdown Header Text Splitter

codepoetry/markdown-splitter

Split Markdown into structured chunks using header hierarchy. Built with LangChain, it preserves metadata for RAG, documentation, and analysis. Configure headers, strip content, and integrate with vector databases. Ideal for AI workflows.

CodePoetry

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

Website to Markdown

logiover/website-to-markdown

Convert any URL to clean Markdown for AI & RAG. Strips ads & junk for noise-free data. Perfect for OpenAI, Pinecone & LangChain. Advanced stealth browsing bypasses anti-bots. Blazing fast, token-efficient extraction for AI Agents and Vector Stores. Your essential AI Data Architect.