Markdown Scraper (Stealth Browser) avatar

Markdown Scraper (Stealth Browser)

Pricing

$30.00/month + usage

Go to Apify Store
Markdown Scraper (Stealth Browser)

Markdown Scraper (Stealth Browser)

Scrape any website to clean markdown for LLMs, RAG, and AI agents. Stealth browser with ad blocking. Only visible content, no cookie banners or hidden menus. Split into header, body, and footer with links. Token-optimized compact output included.

Pricing

$30.00/month + usage

Rating

0.0

(0)

Developer

Thodor

Thodor

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

4 days ago

Last modified

Share

What you get

{
"url": "https://apify.com",
"title": "Apify: Full-stack web scraping and data extraction platform",
"header": {
"markdown": "Contact sales\n\nLog in\n\nGet started",
"compact": "Contact sales\nLog in\nGet started",
"links": [
{ "text": "Contact sales", "url": "https://apify.com/contact-sales" },
{ "text": "Log in", "url": "https://console.apify.com/sign-in" },
{ "text": "Get started", "url": "https://console.apify.com/sign-up" }
]
},
"body": {
"markdown": "# Get real-time web data for your AI\n\nApify Actors scrape up-to-date web data from any website for AI apps and agents...\n\n### TikTok Scraper\n\nclockworks / tiktok-scraper\n\nExtract data from TikTok videos, hashtags, and users...",
"compact": "Get real-time web data for your AI\nApify Actors scrape up-to-date web data from any website for AI apps and agents...\nTikTok Scraper\nclockworks / tiktok-scraper\nExtract data from TikTok videos, hashtags, and users...",
"links": [
{ "text": "TikTok Scraper", "url": "https://apify.com/clockworks/tiktok-scraper" },
{ "text": "Google Maps Scraper", "url": "https://apify.com/compass/crawler-google-places" },
{ "text": "Instagram Scraper", "url": "https://apify.com/apify/instagram-scraper" }
]
},
"footer": {
"markdown": "Product\n\n- [Apify Store]\n- [Integrations]\n- [Proxy]\n\nDevelopers\n\n- [Documentation]\n- [Code templates]\n- [API reference]\n\n...",
"compact": "Product\nApify Store\nIntegrations\nProxy\nDevelopers\nDocumentation\nCode templates\nAPI reference\n...",
"links": [
{ "text": "Apify Store", "url": "https://apify.com/store" },
{ "text": "Documentation", "url": "https://docs.apify.com/" },
{ "text": "Terms of Use", "url": "https://docs.apify.com/legal/general-terms-and-conditions" }
]
}
}

The page is split into header, body, and footer. Each section has its own markdown, a token-optimized compact version (no formatting, minimal whitespace), and a list of links. Use markdown when you need structure, compact when you want to save tokens.

Why this one?

  • Only visible content. Opens the page in a real browser and extracts only what a visitor would actually see. Hidden dropdown menus, cookie consent popups, and invisible elements are filtered out.
  • Ads and trackers blocked. uBlock Origin is built into the browser. Ads, tracking pixels, and cookie walls never make it into your output.
  • Stealth browser. Uses anti-fingerprinting so the browser looks like a regular visitor. Enable residential proxies for extra-stubborn sites.
  • Clean heading structure. Heading levels are compressed to remove gaps and stay within 1-6. If a page uses h1, h4, h5, they become h1, h2, h3.
  • Links extracted separately. Links are pulled out as structured data (text + URL) per section, deduplicated and capped at a configurable limit.
  • Multiple URLs in one run. Pass a list of URLs to process them all in a single run. The browser session is reused, so you only pay for one startup instead of one per page.
  • Token-optimized output. Each section includes a compact field with all markdown formatting stripped. Same content, fewer tokens.
  • Low cost per run. Images are swapped for tiny placeholders instead of downloaded. Fonts and media are skipped entirely.

Under the hood

  • Stealth browser. Runs on Camoufox, a hardened Firefox fork with anti-fingerprinting. Sites see a normal visitor, not a bot.
  • uBlock Origin built in. Ads, trackers, and cookie walls are blocked before the page even renders.
  • Visibility filtering. Uses the browser's own visibility engine to skip hidden elements (collapsed menus, cookie popups, off-screen content, display: none blocks). Only what's actually on screen gets extracted.
  • Image placeholders. Images are replaced with a tiny SVG instead of downloaded. Keeps page layouts intact while cutting bandwidth and cost.
  • Layout table handling. Old-school sites that use <table> for layout (like Hacker News) are detected automatically. Content comes out as clean text, not broken markdown tables.

Input

FieldTypeDescription
urlstringA single URL to convert.
urlsstring[]Multiple URLs to convert in one run. Saves cost by reusing the browser session.
proxyConfigurationobjectEnable proxies to avoid blocks. Use residential proxies for sites that block datacenter IPs.
waitForSelectorstringWait for a specific element to appear before extracting. Useful for pages that load content dynamically.
removeSelectorsstring[]Elements to remove before extraction (e.g. sidebar, comments, ad banners).
maxLinksintegerMaximum number of links to extract across all sections. Default: 100. Set to 0 for no links.

Tips

  • Try without proxies first. The stealth browser gets through most anti-bot protections on its own. Only enable proxies if you're getting blocked. Start with datacenter proxies, and switch to residential only if needed.
  • How it waits. The Actor waits for the page DOM to load, then gives JavaScript 3 seconds to render dynamic content. If you set waitForSelector, it will also wait up to 10 seconds for that element to appear. For most sites, the default behavior is enough.
  • Multiple URLs. You can use url and urls together. Duplicates are removed automatically. Pages are processed one at a time to keep things stable.
  • Automatic retries. If a page fails to load, it retries up to 3 times before moving on to the next URL.

Use cases

  • RAG pipelines. Feed clean page content into vector databases.
  • AI agents. Give agents focused page content with navigation separated from main content.
  • Content monitoring. Track changes in the body without nav/footer noise.
  • Competitive analysis. Extract and compare landing pages across competitors.
  • Dataset building. Create clean text datasets from web pages.

Missing something?

This is a solo project and I'm actively building on it. If you need a feature, have a bug, or want something changed, just open an issue or message me directly. I read everything and ship fast.