Pricing

Pay per usage

Stealth Website Crawler

Crawl websites protected by Cloudflare, DataDome, and other anti-bot systems. Extract clean text or markdown for AI/LLM pipelines. Like Website Content Crawler, but for sites that block you.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Nocturne

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

What does Stealth Website Crawler do?

Stealth Website Crawler extracts content from websites that block standard scrapers. It uses a binary-patched Chromium browser (Patchright) running in headed mode on a virtual display with full fingerprint spoofing, human behavior simulation, and container marker hiding.

If Website Content Crawler fails on a site because of Cloudflare, DataDome, or other anti-bot protection, this actor is the solution.

Extract clean markdown, HTML, or plain text from any web page
Bypass Cloudflare, DataDome, Akamai, PerimeterX, and other anti-bot systems
Take screenshots of any page, element, or full page
Run custom JavaScript on protected pages
Perform interactive actions (click, fill, scroll, hover) without writing code
Capture network responses (intercept JSON/API calls the page makes)
Crawl entire sites with link following, URL filters, and depth control
Extract metadata (title, description, author, canonical URL, language, Open Graph)

What can it bypass?

Protection	Status
Cloudflare Turnstile & Challenge	Bypassed
DataDome	Bypassed
Akamai	Bypassed
PerimeterX / HUMAN	Bypassed
Fingerprint.com	Bypassed
CreepJS	Bypassed
Pixelscan	Bypassed
navigator.webdriver detection	Bypassed
CDP (Chrome DevTools Protocol) detection	Bypassed
Canvas fingerprinting	Session-stable noise
AudioContext fingerprinting	Session-stable noise
WebGL renderer detection	Spoofed

How to use Stealth Website Crawler

Create a free Apify account
Open Stealth Website Crawler in Apify Console
Enter the URLs you want to crawl
Click Start and wait for the run to finish
Download your data as JSON, CSV, Excel, or use the API

What data can you extract?

Field	Description
url	Original URL requested
loadedUrl	Final URL after redirects
title	Page title
content	Clean extracted content (markdown, HTML, or text)
contentLength	Length of extracted content
statusCode	HTTP status code
metadata.description	Meta description or Open Graph description
metadata.author	Author meta tag
metadata.keywords	Keywords meta tag
metadata.canonicalUrl	Canonical URL
metadata.languageCode	Language from HTML lang attribute
metadata.ogTitle	Open Graph title
metadata.ogImage	Open Graph image URL
crawl.depth	How many links deep from the start URL
crawl.referrerUrl	The page that linked to this one
screenshotKey	Key-value store key for screenshot (if enabled)
scrapedAt	ISO timestamp of when the page was scraped

Two modes of operation

Crawl mode (default)

Give it URLs and it crawls pages automatically, following same-domain links. No code required.

Example input:

{
  "startUrls": [{ "url": "https://cloudflare-protected-site.com" }],
  "maxCrawlPages": 50,
  "outputFormat": "markdown",
  "followLinks": true,
  "maxCrawlDepth": 3
}

Example output:

{
  "url": "https://example.com/about",
  "loadedUrl": "https://example.com/about",
  "title": "About Us - Example Company",
  "content": "# About Us\n\nExample Company was founded in 2020...\n\n## Our Mission\n\nWe believe in making data accessible...",
  "contentLength": 4523,
  "statusCode": 200,
  "metadata": {
    "description": "Learn about Example Company, our mission, and our team.",
    "author": "Example Company",
    "keywords": "about, company, mission",
    "canonicalUrl": "https://example.com/about",
    "languageCode": "en",
    "ogTitle": "About Us - Example Company",
    "ogImage": "https://example.com/images/team.jpg"
  },
  "crawl": {
    "depth": 1,
    "referrerUrl": "https://example.com"
  },
  "scrapedAt": "2026-03-20T09:30:00.000Z"
}

Interactive mode

Provide an actions array to click buttons, fill forms, take screenshots, run JavaScript, and extract specific data. Actions are executed in order on each URL.

Example: Search and extract results

{
  "startUrls": [{ "url": "https://protected-site.com/search" }],
  "actions": [
    { "type": "fill", "selector": "#search-input", "value": "web scraping" },
    { "type": "click", "selector": "button[type=submit]" },
    { "type": "wait", "selector": ".results" },
    { "type": "screenshot", "fullPage": true },
    { "type": "extractContent", "format": "markdown" }
  ]
}

Example: Extract rendered JavaScript content

{
  "startUrls": [{ "url": "https://spa-app.com/dashboard" }],
  "initialCookies": [{ "name": "session", "value": "abc123", "domain": "spa-app.com", "path": "/" }],
  "actions": [
    { "type": "wait", "selector": ".data-loaded" },
    { "type": "javascript", "expression": "JSON.stringify([...document.querySelectorAll('.item')].map(el => ({name: el.querySelector('.name').textContent, price: el.querySelector('.price').textContent})))" },
    { "type": "screenshot" }
  ]
}

Example: Capture API responses from the page

{
  "startUrls": [{ "url": "https://site-with-internal-api.com" }],
  "captureNetwork": true,
  "actions": [
    { "type": "scroll", "pages": 3 },
    { "type": "captureNetwork" }
  ]
}

Available actions

Action	Parameters	What it does
`click`	`selector`	Click an element
`fill`	`selector`, `value`	Set input value instantly
`type`	`selector`, `value`	Type with human-like keystroke delays
`scroll`	`pages` (default 1)	Scroll with realistic behavior
`hover`	`selector`	Hover over element
`select`	`selector`, `value`	Select dropdown option
`wait`	`selector` / `time` (ms) / `navigation`	Wait for condition
`screenshot`	`fullPage`, `selector`, `key`	Screenshot to key-value store
`javascript`	`expression`	Run JS and return result
`extractHtml`	`selector` (optional)	Get rendered DOM HTML
`extractContent`	`format` (markdown/html/text)	Clean content extraction
`captureNetwork`		Return intercepted JSON/API responses
`humanActivity`		Simulate scroll + Bezier mouse movement
`mouseMove`	`x`, `y`	Bezier curve mouse movement

How can I use the scraped data?

AI and LLM pipelines: Feed content from anti-bot sites into RAG pipelines, vector databases, or LLM fine-tuning. Works with LangChain, LlamaIndex, and other frameworks.
Competitor monitoring: Track content changes on Cloudflare-protected competitor websites.
Price tracking: Extract product prices from e-commerce sites with aggressive anti-bot (Nike, Amazon, Booking.com).
Lead generation: Scrape business directories and review sites behind anti-bot protection.
SEO research: Crawl competitor sites to analyze content structure, metadata, and internal linking.
Academic research: Extract data from government portals, news sites, and academic databases behind Cloudflare.
Brand monitoring: Track mentions and content on social platforms with browser fingerprinting.
Market research: Scrape real estate listings, job postings, and travel prices from protected sites.

Input configuration

Field	Default	Description
`startUrls`	required	URLs to process
`actions`	none	Actions array (enables interactive mode)
`maxCrawlPages`	10	Max pages to crawl
`maxCrawlDepth`	20	Max link depth from start URLs
`outputFormat`	markdown	Content format: `markdown`, `html`, or `text`
`followLinks`	true	Follow same-domain links
`includeUrlGlobs`	none	Only crawl URLs matching these glob patterns
`excludeUrlGlobs`	none	Skip URLs matching these glob patterns
`waitForSelector`	none	CSS selector to wait for before extraction
`takeScreenshots`	false	Screenshot each page to key-value store
`blockResources`	false	Block images/fonts/CSS to save proxy bandwidth
`initialCookies`	none	Cookies for authenticated scraping `[{name, value, domain, path}]`
`captureNetwork`	false	Record JSON/API responses made by the page
`maxConcurrency`	3	Concurrent pages (lower = safer stealth, higher = faster)
`requestTimeoutSecs`	30	Page load timeout in seconds
`maxRetries`	1	Retry attempts for failed pages
`proxyConfig`	Apify residential	Proxy settings (residential strongly recommended)

How it compares to Website Content Crawler

Feature	Website Content Crawler	Stealth Website Crawler
Anti-bot bypass	JS-level fingerprints	Binary-patched Chromium + full stealth stack
Cloudflare sites	Often blocked	Works
DataDome / Akamai sites	Usually blocked	Works
Browser mode	Headless	Headed on virtual display (more stealthy)
Screenshots	Firefox only	Any page, element, or full page
Custom JS execution	Page function (code required)	`javascript` action (no code needed)
Interactive actions	Click to expand only	15 action types with no code
Network response capture	No	Intercepts JSON/API responses

Integrations

You can connect Stealth Website Crawler with your existing tools and workflows:

API: Run the actor programmatically via the Apify API using Node.js or Python clients
Webhooks: Get notified when a run finishes
Make (Integromat): Automate workflows with scraped data
Zapier: Connect to 5,000+ apps
Google Sheets: Export results directly to spreadsheets
Slack: Send notifications about scraping results
GitHub: Trigger runs from CI/CD pipelines
Airbyte: Sync data to databases and warehouses

Use via API

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("nocturne/stealth-website-crawler").call(run_input={
    "startUrls": [{"url": "https://cloudflare-protected-site.com"}],
    "maxCrawlPages": 10,
    "outputFormat": "markdown",
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["title"], item["url"])
    print(item["content"][:200])

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('nocturne/stealth-website-crawler').call({
    startUrls: [{ url: 'https://cloudflare-protected-site.com' }],
    maxCrawlPages: 10,
    outputFormat: 'markdown',
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => console.log(item.title, item.url));

Frequently asked questions

How is this different from Website Content Crawler? Website Content Crawler uses standard Playwright with JavaScript-level fingerprint injection. Stealth Website Crawler uses Patchright, which patches Chromium at the binary level to remove automation detection leaks. It also runs in headed mode on a virtual display (Xvfb) with canvas/audio/WebGL fingerprint noise, human behavior simulation, and container marker hiding. It works on sites where WCC gets blocked.

Do I need residential proxies? Strongly recommended. The actor defaults to Apify residential proxies. Datacenter IPs are blocked by most anti-bot systems regardless of how good the browser stealth is. Residential proxies add ~$0.002/page in bandwidth cost.

Can it solve CAPTCHAs? The stealth browser avoids triggering CAPTCHAs in most cases by appearing as a real user. If a CAPTCHA does appear, the page is automatically retried with a new proxy IP. Explicit CAPTCHA solving (reCAPTCHA, hCaptcha) is not currently included.

Does it work on login-required pages? You can pass initialCookies for authenticated sessions. Use your browser's developer tools to copy session cookies, then pass them as input. The actor does not handle login flows (username/password entry) automatically, but you can use interactive mode with fill and click actions to automate login.

How much does it cost? The actor itself is free. You pay only for Apify platform usage (compute and proxy bandwidth). A typical crawl of 1,000 pages costs approximately $3-8 depending on page complexity and proxy usage. Check the Apify pricing page for details.

Can I use it with Make, Zapier, or other integrations? Yes. The actor works with all standard Apify integrations including Make, Zapier, Slack, Google Sheets, webhooks, and the Apify API (Node.js and Python clients).

Can I use it via the API? Yes. You can run the actor programmatically using the Apify API, the Python client, or the Node.js client. See the API usage examples above.

What's the success rate? Varies by target site and protection system. Typical success rates with residential proxies:

Cloudflare-protected sites: 90-98%
DataDome sites: 85-95%
Akamai sites: 85-95%
PerimeterX sites: 80-90%

Can I scrape JavaScript-rendered (SPA) pages? Yes. The actor runs a full Chromium browser that renders JavaScript completely before extracting content. Use waitForSelector to wait for dynamic content to load, or use javascript actions to extract data from the rendered DOM.

Is it legal to scrape websites?

Scraping publicly available data is generally considered legal based on the US Ninth Circuit Court ruling (hiQ Labs v. LinkedIn). However:

Always respect the website's Terms of Service
Do not scrape personal data without a lawful basis under GDPR/CCPA
Do not overload target servers with excessive request rates
Consider using the maxConcurrency setting to limit parallel requests

We recommend consulting a legal professional if you have questions about scraping specific websites. Read Apify's blog post on the legality of web scraping for more context.

Tips for best results

Use residential proxies: Always use Apify residential proxies for anti-bot sites. Datacenter IPs are detected and blocked regardless of browser stealth.
Lower concurrency for harder sites: Set maxConcurrency to 1-2 for sites with aggressive anti-bot. Higher concurrency increases detection risk.
Use waitForSelector: For JavaScript-heavy sites, specify a CSS selector that appears only after the content loads.
Block resources to save bandwidth: Enable blockResources if you don't need images, fonts, or CSS. This reduces residential proxy costs significantly.
Use include/exclude globs: Focus your crawl on relevant pages. For example, includeUrlGlobs: ["https://example.com/blog/*"] avoids crawling unrelated sections.

Feedback and support

If you encounter any issues or have suggestions, please open an issue in the Issues tab. We actively monitor and respond to all reports.

Found a bug? Have a feature request? We want to hear from you.

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Hounderd

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website Content Crawler Lite — Markdown & Text Crawler

fetch_cat/website-content-crawler-lite

Crawl public website pages into clean text, Markdown, metadata, links, depth, status, and skip/error rows for AI, RAG, SEO audits, and monitoring.

Hanna Nosova

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

AI Website Content Crawler

ilborso/ai-website-content-crawler

A super fast website crawler for Agentic AI integration

Fabio Borsotti

5.0

Website Content Crawler

ayeeyee/website-content-crawler

Full website crawling

Virtual Footprint LLC

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.