Stealth Website Crawler
Pricing
Pay per usage
Stealth Website Crawler
Crawl websites protected by Cloudflare, DataDome, and other anti-bot systems. Extract clean text or markdown for AI/LLM pipelines. Like Website Content Crawler, but for sites that block you.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Nocturne
Actor stats
0
Bookmarked
3
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share

What does Stealth Website Crawler do?
Stealth Website Crawler extracts content from websites that block standard scrapers. It uses a binary-patched Chromium browser (Patchright) running in headed mode on a virtual display with full fingerprint spoofing, human behavior simulation, and container marker hiding.
If Website Content Crawler fails on a site because of Cloudflare, DataDome, or other anti-bot protection, this actor is the solution.
- Extract clean markdown, HTML, or plain text from any web page
- Bypass Cloudflare, DataDome, Akamai, PerimeterX, and other anti-bot systems
- Take screenshots of any page, element, or full page
- Run custom JavaScript on protected pages
- Perform interactive actions (click, fill, scroll, hover) without writing code
- Capture network responses (intercept JSON/API calls the page makes)
- Crawl entire sites with link following, URL filters, and depth control
- Extract metadata (title, description, author, canonical URL, language, Open Graph)
What can it bypass?
| Protection | Status |
|---|---|
| Cloudflare Turnstile & Challenge | Bypassed |
| DataDome | Bypassed |
| Akamai | Bypassed |
| PerimeterX / HUMAN | Bypassed |
| Fingerprint.com | Bypassed |
| CreepJS | Bypassed |
| Pixelscan | Bypassed |
| navigator.webdriver detection | Bypassed |
| CDP (Chrome DevTools Protocol) detection | Bypassed |
| Canvas fingerprinting | Session-stable noise |
| AudioContext fingerprinting | Session-stable noise |
| WebGL renderer detection | Spoofed |
How to use Stealth Website Crawler
- Create a free Apify account
- Open Stealth Website Crawler in Apify Console
- Enter the URLs you want to crawl
- Click Start and wait for the run to finish
- Download your data as JSON, CSV, Excel, or use the API
What data can you extract?
| Field | Description |
|---|---|
| url | Original URL requested |
| loadedUrl | Final URL after redirects |
| title | Page title |
| content | Clean extracted content (markdown, HTML, or text) |
| contentLength | Length of extracted content |
| statusCode | HTTP status code |
| metadata.description | Meta description or Open Graph description |
| metadata.author | Author meta tag |
| metadata.keywords | Keywords meta tag |
| metadata.canonicalUrl | Canonical URL |
| metadata.languageCode | Language from HTML lang attribute |
| metadata.ogTitle | Open Graph title |
| metadata.ogImage | Open Graph image URL |
| crawl.depth | How many links deep from the start URL |
| crawl.referrerUrl | The page that linked to this one |
| screenshotKey | Key-value store key for screenshot (if enabled) |
| scrapedAt | ISO timestamp of when the page was scraped |
Two modes of operation
Crawl mode (default)
Give it URLs and it crawls pages automatically, following same-domain links. No code required.
Example input:
{"startUrls": [{ "url": "https://cloudflare-protected-site.com" }],"maxCrawlPages": 50,"outputFormat": "markdown","followLinks": true,"maxCrawlDepth": 3}
Example output:
{"url": "https://example.com/about","loadedUrl": "https://example.com/about","title": "About Us - Example Company","content": "# About Us\n\nExample Company was founded in 2020...\n\n## Our Mission\n\nWe believe in making data accessible...","contentLength": 4523,"statusCode": 200,"metadata": {"description": "Learn about Example Company, our mission, and our team.","author": "Example Company","keywords": "about, company, mission","canonicalUrl": "https://example.com/about","languageCode": "en","ogTitle": "About Us - Example Company","ogImage": "https://example.com/images/team.jpg"},"crawl": {"depth": 1,"referrerUrl": "https://example.com"},"scrapedAt": "2026-03-20T09:30:00.000Z"}
Interactive mode
Provide an actions array to click buttons, fill forms, take screenshots, run JavaScript, and extract specific data. Actions are executed in order on each URL.
Example: Search and extract results
{"startUrls": [{ "url": "https://protected-site.com/search" }],"actions": [{ "type": "fill", "selector": "#search-input", "value": "web scraping" },{ "type": "click", "selector": "button[type=submit]" },{ "type": "wait", "selector": ".results" },{ "type": "screenshot", "fullPage": true },{ "type": "extractContent", "format": "markdown" }]}
Example: Extract rendered JavaScript content
{"startUrls": [{ "url": "https://spa-app.com/dashboard" }],"initialCookies": [{ "name": "session", "value": "abc123", "domain": "spa-app.com", "path": "/" }],"actions": [{ "type": "wait", "selector": ".data-loaded" },{ "type": "javascript", "expression": "JSON.stringify([...document.querySelectorAll('.item')].map(el => ({name: el.querySelector('.name').textContent, price: el.querySelector('.price').textContent})))" },{ "type": "screenshot" }]}
Example: Capture API responses from the page
{"startUrls": [{ "url": "https://site-with-internal-api.com" }],"captureNetwork": true,"actions": [{ "type": "scroll", "pages": 3 },{ "type": "captureNetwork" }]}
Available actions
| Action | Parameters | What it does |
|---|---|---|
click | selector | Click an element |
fill | selector, value | Set input value instantly |
type | selector, value | Type with human-like keystroke delays |
scroll | pages (default 1) | Scroll with realistic behavior |
hover | selector | Hover over element |
select | selector, value | Select dropdown option |
wait | selector / time (ms) / navigation | Wait for condition |
screenshot | fullPage, selector, key | Screenshot to key-value store |
javascript | expression | Run JS and return result |
extractHtml | selector (optional) | Get rendered DOM HTML |
extractContent | format (markdown/html/text) | Clean content extraction |
captureNetwork | Return intercepted JSON/API responses | |
humanActivity | Simulate scroll + Bezier mouse movement | |
mouseMove | x, y | Bezier curve mouse movement |
How can I use the scraped data?
- AI and LLM pipelines: Feed content from anti-bot sites into RAG pipelines, vector databases, or LLM fine-tuning. Works with LangChain, LlamaIndex, and other frameworks.
- Competitor monitoring: Track content changes on Cloudflare-protected competitor websites.
- Price tracking: Extract product prices from e-commerce sites with aggressive anti-bot (Nike, Amazon, Booking.com).
- Lead generation: Scrape business directories and review sites behind anti-bot protection.
- SEO research: Crawl competitor sites to analyze content structure, metadata, and internal linking.
- Academic research: Extract data from government portals, news sites, and academic databases behind Cloudflare.
- Brand monitoring: Track mentions and content on social platforms with browser fingerprinting.
- Market research: Scrape real estate listings, job postings, and travel prices from protected sites.
Input configuration
| Field | Default | Description |
|---|---|---|
startUrls | required | URLs to process |
actions | none | Actions array (enables interactive mode) |
maxCrawlPages | 10 | Max pages to crawl |
maxCrawlDepth | 20 | Max link depth from start URLs |
outputFormat | markdown | Content format: markdown, html, or text |
followLinks | true | Follow same-domain links |
includeUrlGlobs | none | Only crawl URLs matching these glob patterns |
excludeUrlGlobs | none | Skip URLs matching these glob patterns |
waitForSelector | none | CSS selector to wait for before extraction |
takeScreenshots | false | Screenshot each page to key-value store |
blockResources | false | Block images/fonts/CSS to save proxy bandwidth |
initialCookies | none | Cookies for authenticated scraping [{name, value, domain, path}] |
captureNetwork | false | Record JSON/API responses made by the page |
maxConcurrency | 3 | Concurrent pages (lower = safer stealth, higher = faster) |
requestTimeoutSecs | 30 | Page load timeout in seconds |
maxRetries | 1 | Retry attempts for failed pages |
proxyConfig | Apify residential | Proxy settings (residential strongly recommended) |
How it compares to Website Content Crawler
| Feature | Website Content Crawler | Stealth Website Crawler |
|---|---|---|
| Anti-bot bypass | JS-level fingerprints | Binary-patched Chromium + full stealth stack |
| Cloudflare sites | Often blocked | Works |
| DataDome / Akamai sites | Usually blocked | Works |
| Browser mode | Headless | Headed on virtual display (more stealthy) |
| Screenshots | Firefox only | Any page, element, or full page |
| Custom JS execution | Page function (code required) | javascript action (no code needed) |
| Interactive actions | Click to expand only | 15 action types with no code |
| Network response capture | No | Intercepts JSON/API responses |
| Human behavior simulation | No | Bezier curve mouse, variable typing, random scrolling | | Content extraction | 5 HTML transformers | Readability + Markdown/HTML/Text | | Metadata extraction | Title, description, language | Title, description, author, keywords, canonical, language, OG tags | | Pricing | Free + platform usage | Free + platform usage |
Integrations
You can connect Stealth Website Crawler with your existing tools and workflows:
- API: Run the actor programmatically via the Apify API using Node.js or Python clients
- Webhooks: Get notified when a run finishes
- Make (Integromat): Automate workflows with scraped data
- Zapier: Connect to 5,000+ apps
- Google Sheets: Export results directly to spreadsheets
- Slack: Send notifications about scraping results
- GitHub: Trigger runs from CI/CD pipelines
- Airbyte: Sync data to databases and warehouses
Use via API
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("nocturne/stealth-website-crawler").call(run_input={"startUrls": [{"url": "https://cloudflare-protected-site.com"}],"maxCrawlPages": 10,"outputFormat": "markdown",})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["title"], item["url"])print(item["content"][:200])
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });const run = await client.actor('nocturne/stealth-website-crawler').call({startUrls: [{ url: 'https://cloudflare-protected-site.com' }],maxCrawlPages: 10,outputFormat: 'markdown',});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => console.log(item.title, item.url));
Frequently asked questions
How is this different from Website Content Crawler? Website Content Crawler uses standard Playwright with JavaScript-level fingerprint injection. Stealth Website Crawler uses Patchright, which patches Chromium at the binary level to remove automation detection leaks. It also runs in headed mode on a virtual display (Xvfb) with canvas/audio/WebGL fingerprint noise, human behavior simulation, and container marker hiding. It works on sites where WCC gets blocked.
Do I need residential proxies? Strongly recommended. The actor defaults to Apify residential proxies. Datacenter IPs are blocked by most anti-bot systems regardless of how good the browser stealth is. Residential proxies add ~$0.002/page in bandwidth cost.
Can it solve CAPTCHAs? The stealth browser avoids triggering CAPTCHAs in most cases by appearing as a real user. If a CAPTCHA does appear, the page is automatically retried with a new proxy IP. Explicit CAPTCHA solving (reCAPTCHA, hCaptcha) is not currently included.
Does it work on login-required pages?
You can pass initialCookies for authenticated sessions. Use your browser's developer tools to copy session cookies, then pass them as input. The actor does not handle login flows (username/password entry) automatically, but you can use interactive mode with fill and click actions to automate login.
How much does it cost? The actor itself is free. You pay only for Apify platform usage (compute and proxy bandwidth). A typical crawl of 1,000 pages costs approximately $3-8 depending on page complexity and proxy usage. Check the Apify pricing page for details.
Can I use it with Make, Zapier, or other integrations? Yes. The actor works with all standard Apify integrations including Make, Zapier, Slack, Google Sheets, webhooks, and the Apify API (Node.js and Python clients).
Can I use it via the API? Yes. You can run the actor programmatically using the Apify API, the Python client, or the Node.js client. See the API usage examples above.
What's the success rate? Varies by target site and protection system. Typical success rates with residential proxies:
- Cloudflare-protected sites: 90-98%
- DataDome sites: 85-95%
- Akamai sites: 85-95%
- PerimeterX sites: 80-90%
Can I scrape JavaScript-rendered (SPA) pages?
Yes. The actor runs a full Chromium browser that renders JavaScript completely before extracting content. Use waitForSelector to wait for dynamic content to load, or use javascript actions to extract data from the rendered DOM.
Is it legal to scrape websites?
Scraping publicly available data is generally considered legal based on the US Ninth Circuit Court ruling (hiQ Labs v. LinkedIn). However:
- Always respect the website's Terms of Service
- Do not scrape personal data without a lawful basis under GDPR/CCPA
- Do not overload target servers with excessive request rates
- Consider using the
maxConcurrencysetting to limit parallel requests
We recommend consulting a legal professional if you have questions about scraping specific websites. Read Apify's blog post on the legality of web scraping for more context.
Tips for best results
- Use residential proxies: Always use Apify residential proxies for anti-bot sites. Datacenter IPs are detected and blocked regardless of browser stealth.
- Lower concurrency for harder sites: Set
maxConcurrencyto 1-2 for sites with aggressive anti-bot. Higher concurrency increases detection risk. - Use
waitForSelector: For JavaScript-heavy sites, specify a CSS selector that appears only after the content loads. - Block resources to save bandwidth: Enable
blockResourcesif you don't need images, fonts, or CSS. This reduces residential proxy costs significantly. - Use include/exclude globs: Focus your crawl on relevant pages. For example,
includeUrlGlobs: ["https://example.com/blog/*"]avoids crawling unrelated sections.
Feedback and support
If you encounter any issues or have suggestions, please open an issue in the Issues tab. We actively monitor and respond to all reports.
Found a bug? Have a feature request? We want to hear from you.