Stealth Website Scraper avatar

Stealth Website Scraper

Pricing

from $1.50 / 1,000 results

Go to Apify Store
Stealth Website Scraper

Stealth Website Scraper

Extract text, links, metadata, HTML, markdown, and structured page data with HTTP-first crawling and stealth-aware browser fallback.

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Solutions Smart

Solutions Smart

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

17 hours ago

Last modified

Share

Stealth Website Scraper extracts text, links, metadata, HTML, markdown, and structured page data from websites using a fast HTTP-first crawl with browser fallback when plain requests are not enough.

It is designed for production scraping and analysis workflows where cost, speed, and reliability all matter. The actor starts with lightweight HTTP crawling through CheerioCrawler, then falls back to a browser flow when the site blocks requests, returns thin content, or depends on JavaScript rendering.

What does Stealth Website Scraper do?

Stealth Website Scraper enables you to extract data from websites with intelligent fallback handling. Whether you are dealing with server-rendered content or JavaScript-heavy pages, this actor adapts its approach to maximize success while minimizing costs.

Stealth Website Scraper can extract:

  • Clean text content and markdown from web pages
  • Page metadata such as title, description, and canonical URL
  • Headings such as H1 and H2
  • Internal and external links
  • HTML source code
  • HTTP status codes and content type information
  • Crawl depth and crawl method information

Why scrape websites?

Websites contain publicly available data that can support AI pipelines, market research, competitive analysis, and business intelligence.

  • AI and RAG pipelines: Feed clean text content into machine learning models and retrieval-augmented generation systems.
  • Business intelligence: Extract metadata, pricing, and product information from competitor websites.
  • Content extraction: Build datasets for training, analysis, and enrichment workflows.
  • Testing and QA: Verify site rendering across different network conditions and browser types.
  • Market research: Gather structured data from public websites at scale.
  • Fingerprint testing: Compare stealth browser behavior against standard browser automation.

How to scrape websites with Stealth Website Scraper

  1. Click Try for free to open the actor.
  2. Enter one or more Start URLs.
  3. Configure optional settings:
    • Max Pages: Limit the number of pages to crawl.
    • Max Depth: Control how deep internal link crawling goes.
    • Crawling Mode: Choose between http-first and browser-only.
    • Stealth Browser: Select cloak to attempt CloakBrowser, or playwright for standard Playwright.
    • Extraction Mode: Choose what data to extract such as all, text, markdown, html, or links.
  4. Click Run.
  5. When the run completes, preview or download your data from the Dataset tab.

How much will it cost to scrape websites?

Apify gives you $5 in free usage credits every month on the Apify Free plan. Since HTTP scraping is much cheaper than browser automation, you can extract many pages for low cost by relying on the HTTP-first strategy.

The Stealth Website Scraper uses HTTP requests first whenever possible. This means:

  • Lower compute costs
  • Faster execution
  • Higher throughput
  • Less browser overhead

Browser fallback only engages when necessary, keeping costs down while maintaining reliability.

For regular large-scale scraping, review current Apify pricing and set maxPages, maxDepth, and concurrency to match your budget.

Results

Each processed page produces one clean dataset item.

Example output

{
"url": "https://example.com",
"loadedUrl": "https://example.com/",
"domain": "example.com",
"title": "Example Domain",
"metaDescription": "A reserved-use domain in DNS",
"canonicalUrl": "https://example.com/",
"h1": ["Example Domain"],
"h2": ["More information"],
"text": "Example Domain This domain is for use in illustrative examples in documents...",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...",
"html": "<!doctype html>\n<html>\n<head>...</head>...",
"links": ["https://example.com/more-info", "https://example.com/about"],
"externalLinks": ["https://www.iana.org/"],
"statusCode": 200,
"contentType": "text/html; charset=UTF-8",
"depth": 0,
"crawlMethod": "http",
"fallbackUsed": false,
"fallbackReason": "",
"timestamp": "2026-05-17T21:00:00.000Z"
}

The actor also stores a final summary in the key-value store under OUTPUT:

{
"pagesScraped": 25,
"httpPages": 22,
"browserPages": 3,
"cloakPages": 0,
"failedPages": 0,
"fallbacks": 3,
"uniqueUrlsQueued": 27,
"startedAt": "2026-05-17T21:00:00.000Z",
"finishedAt": "2026-05-17T21:05:30.000Z",
"durationSeconds": 330
}

HTTP-first vs. browser mode

HTTP-first mode

The actor starts with lightweight HTTP requests using CheerioCrawler. This is the fastest and cheapest approach.

Browser fallback triggers when:

  • The site returns 403, 429, or 503
  • The HTTP response body is empty or below minTextLengthForSuccess
  • The page appears JavaScript-heavy
  • Text extraction returns minimal content

Browser-only mode

Skip HTTP entirely and crawl exclusively with a browser. Useful for:

  • JavaScript-heavy single-page applications
  • Sites with stronger bot protection
  • Pages requiring browser rendering

Stealth browser options

  • cloak: Attempts CloakBrowser, a fingerprint-aware Chromium fork with source-level stealth patches. Falls back to standard Playwright if unavailable.
  • playwright: Uses standard Playwright Chromium.

Input parameters

Essential parameters

  • startUrls: Array of URLs or objects with a url property. Required.
  • maxPages: Maximum pages to scrape. Default: 100.
  • maxDepth: Maximum link depth for crawling. Default: 2.
  • crawlingMode: http-first or browser-only. Default: http-first.

Crawling options

  • scrapeInternalLinks: Enable internal link crawling. Default: true.
  • sameDomainOnly: Limit crawling to the start domain. Default: true.
  • maxConcurrency: Concurrent request limit. Default: 5.
  • requestTimeoutSecs: Request timeout in seconds. Default: 30.

Extraction options

  • extractionMode: all, text, markdown, html, or links. Default: all.
  • includeHtml: Include full HTML source. Default: false.
  • includeLinks: Extract internal links. Default: true.
  • includeExternalLinks: Extract external links. Default: false.

Browser options

  • stealthBrowser: cloak or playwright. Default: cloak.
  • waitUntil: domcontentloaded, load, or networkidle. Default: domcontentloaded.
  • blockResources: Block images, fonts, media, and stylesheets to speed up rendering. Default: true.
  • fallbackOnStatusCodes: Status codes that trigger browser fallback. Default: [403, 429, 503].
  • minTextLengthForSuccess: Minimum text length to avoid fallback. Default: 300.

Proxy and headers

  • proxyConfiguration: Proxy setup, including Apify Proxy.
  • customUserAgent: Custom User-Agent header.

Example input

{
"startUrls": [
{ "url": "https://example.com" },
"https://example.org/docs"
],
"maxPages": 50,
"maxDepth": 2,
"sameDomainOnly": true,
"scrapeInternalLinks": true,
"extractionMode": "all",
"crawlingMode": "http-first",
"stealthBrowser": "cloak",
"fallbackOnStatusCodes": [403, 429, 503],
"minTextLengthForSuccess": 300,
"waitUntil": "domcontentloaded",
"blockResources": true,
"includeHtml": false,
"includeLinks": true,
"includeExternalLinks": false,
"maxConcurrency": 5,
"requestTimeoutSecs": 30,
"proxyConfiguration": {
"useApifyProxy": true
}
}

Tips for scraping websites

  • Start with HTTP mode: Most websites serve useful HTML on initial request. HTTP-first saves money and runs faster.
  • Set appropriate depth limits: Use maxPages and maxDepth to control crawl scope and stay within budget.
  • Use domain filtering: Enable sameDomainOnly to prevent crawling into unrelated domains.
  • Adjust timeout settings: Increase requestTimeoutSecs for slow or distant servers.
  • Enable resource blocking: Keep blockResources set to true to skip heavy browser resources.
  • Monitor fallback rates: Check the run summary to see how many pages needed browser fallback.
  • Test stealth options: Use stealthBrowser: "cloak" if standard Playwright gets blocked.
  • Respect robots.txt: Review website policies before scraping.

Limitations

  • Browser mode is more expensive than HTTP mode.
  • Some websites require authentication, session warmup, or custom logic.
  • Very aggressive protection systems may still throttle or block requests.
  • CloakBrowser requires binary download at runtime if not preinstalled.
  • External links can be extracted, but crawl expansion stays focused on start domains by default.

Scraping is legal in many jurisdictions, but you still need to follow applicable laws and website policies.

  • Respect robots.txt: Check the website's robots.txt file and follow its rules where appropriate.
  • Review Terms of Service: Some sites explicitly prohibit scraping in their terms.
  • Protect personal data: Personal data may be protected by GDPR and similar laws. Only scrape it when you have a lawful basis.
  • Do not overload servers: Use appropriate concurrency and crawl limits.
  • Respect copyright: Do not republish copyrighted content without permission.

If you are unsure whether scraping a specific website is legal for your use case, consult a lawyer. For more information, read Is web scraping legal?.