Pricing

from $10.00 / 1,000 page scrapeds

Modern Web Crawler — Adaptive + Stealth + Analytics

Modern replacement for the Legacy PhantomJS Crawler. Auto HTTP/Browser detection, basic anti-bot stealth, built-in analytics, data quality scoring, captcha solver integration. Modern Chrome + Cheerio engine — no PhantomJS, no abandoned tech. Proxies included.

Pricing

from $10.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

Yuliia Kulakova

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

🔥 Why migrate from PhantomJS Crawler?

The Legacy PhantomJS Crawler ships with a browser engine abandoned in 2018. It only speaks ES5.1, can't render modern HTML5/CSS3, is trivially fingerprinted by every anti-bot system, and receives no security updates.

This actor is what PhantomJS Crawler users have been asking for — a modern crawler with a familiar input shape (startUrls, pageFunction, globs/excludeGlobs), backed by real Chrome + Cheerio under the hood.

✨ What you get

🎯 Adaptive crawler mode

Pick HTTP, Browser, or let the crawler decide:

HTTP (Cheerio) — fast and cheap, ~10–20× lighter than browser mode. Great for static pages, news sites, blogs, REST APIs.
Browser (Playwright + Chromium) — full JavaScript rendering. Required for SPAs, React/Vue/Next.js, infinite scroll, sites that fetch content via XHR.
Adaptive (default) — fetches every page with HTTP first. If the page is detected as JavaScript-rendered (empty content, SPA shell, inline var data=[...] patterns, fetch/XHR-driven rendering), the crawler remembers the domain and switches it to Browser mode for the rest of the crawl. Best of both worlds — no wasted browser launches on static content.

🥷 Anti-bot stealth (basic protection sites)

Built-in patches that defeat common detection patterns:

navigator.webdriver flag masking
Realistic plugin / language / permission fingerprints
Rotating, browser-realistic User-Agent + full Sec-Fetch-* header set
Session pool with cookie persistence across requests
Per-request random delays with human-like jitter

What this defeats: sites with basic to moderate detection — most corporate sites, news portals, documentation pages, marketplaces without dedicated bot defence.

What this does not defeat: premium anti-bot services like Cloudflare Pro Challenge, PerimeterX / HUMAN, DataDome, Akamai Bot Manager, Imperva Incapsula. For those sites, configure the captcha solver below or use commercial unblocking services.

🧠 Captcha solver integration

Plug in your 2Captcha / Anti-Captcha / CapMonster API key in captchaSolverApiKey and the crawler will automatically detect and solve:

Cloudflare Turnstile
reCAPTCHA v2 / v3
hCaptcha
Image / text challenges

Captcha-protected pages get a second pass after the solver returns the token.

📊 Built-in analytics

Every crawl produces a comprehensive report saved to the run's Key-Value store as ANALYTICS_REPORT:

Success rate, error breakdown by category
Response time percentiles (avg, median, p95, p99)
HTTP vs Browser distribution + adaptive switch count
Per-domain crawl speed
Total bandwidth consumed
Top 10 errors with URLs (for debugging)
Depth distribution histogram

No extra setup — just one toggle (enableAnalytics, default on).

✅ Data quality score

Every item gets a dataQuality block — a 0–100 score plus a breakdown of what's missing:

{
  "isValid": true,
  "score": 100,
  "missingFields": [],
  "emptyFields": [],
  "warnings": []
}

score — 0–100 completeness rating. With requiredFields set, it's the percentage of those fields that are present and non-empty. Without requiredFields, the score reflects general data issues (encoding, whitespace-only values, etc.).
isValid — true when all required fields are present and non-empty.
missingFields / emptyFields — which required fields didn't make it.
warnings — encoding issues, replacement characters, whitespace-only values.

Validate against any field from the merged item (top-level and your page function's return value). Filter or sort your dataset by dataQuality.score to surface the best results first.

🪲 Error debugging

When a page fails:

Categorized error type (Timeout, Blocked, RateLimited, Navigation, JsError, …)
Optional HTML snapshot saved to Key-Value store (saveSnapshots: true, default on)
Optional screenshot in browser mode (saveScreenshotPerPage: true)

Makes "why did this page fail?" answerable in seconds.

💡 Use cases

Audience	Example use
Migration from PhantomJS Crawler	Drop-in replacement — same `pageFunction` shape, same `startUrls`/`globs` filters
SEO research	Crawl a domain, extract Open Graph + Twitter + JSON-LD metadata at scale
Lead generation	Extract contacts (emails, phones, social profiles) from company sites
Content monitoring	Watch a list of URLs for changes — adaptive rendering catches client-side updates
AI training data	Crawl reference sites, output clean text per page for embeddings
QA / compliance	Validate every page in a site against a schema (e.g. "every product page must have `price`, `availability`, `sku`")

🚀 Quick start

Minimum input — start from a URL, accept all defaults:

{
  "startUrls": [{"url": "https://example.com/"}]
}

This crawls example.com in adaptive mode, max 100 pages, depth 10, extracts metadata, validates output. The default page function returns { url, title, description, h1, canonicalUrl }.

📥 Common inputs

Limit the crawl

{
  "startUrls": [{"url": "https://example.com/"}],
  "maxCrawlPages": 50,
  "maxCrawlDepth": 3,
  "maxConcurrency": 10
}

Filter URLs with globs

{
  "startUrls": [{"url": "https://example.com/"}],
  "globs": [{"glob": "https://example.com/products/**"}],
  "excludeGlobs": [
    {"glob": "**/login**"},
    {"glob": "**/account/**"}
  ]
}

Custom page function (extract structured data)

{
  "startUrls": [{"url": "https://quotes.toscrape.com/"}],
  "maxCrawlPages": 10,
  "pageFunction": "async function pageFunction(context) {\n  const { $, request } = context;\n  const quotes = [];\n  $('div.quote').each((i, el) => {\n    quotes.push({\n      text: $(el).find('span.text').text(),\n      author: $(el).find('small.author').text()\n    });\n  });\n  return { url: request.loadedUrl, quotes };\n}"
}

Force browser mode (for JS-heavy sites)

{
  "startUrls": [{"url": "https://app.example.com/"}],
  "crawlerMode": "browser",
  "waitUntil": "networkidle"
}

Premium anti-bot site (captcha solver)

{
  "startUrls": [{"url": "https://protected-site.com/"}],
  "crawlerMode": "browser",
  "stealthMode": true,
  "captchaSolverApiKey": "YOUR_2CAPTCHA_KEY"
}

📤 Output

One item per crawled page. Top-level fields the actor always populates:

Field	Type	Description
`url`	string	Original request URL
`loadedUrl`	string	Final URL after redirects
`httpStatus`	number	Final HTTP status code
`title`	string	`<title>` of the page
`type`	string	`"StartUrl"` or `"FoundLink"`
`referrerUrl`	string	Page that linked to this one (null for start URLs)
`crawlDepth`	number	Link distance from a start URL
`crawlerMode`	string	`"http"` or `"browser"` for this specific page
`loadTimeMs`	number	Page load time
`downloadedBytes`	number	Response body size
`requestedAt` / `loadingFinishedAt` / `timestamp`	ISO date	Per-request timing
`requestId`	string	Unique ID for cross-reference
`retryCount`	number	Retries before success
`method`	string	HTTP method used
`responseHeaders`	object	All response headers
`pageFunctionResult`	object	Whatever your `pageFunction` returned
`metadata`	object	`og`, `twitter`, `jsonLd`, `meta` (if `extractMetadata: true`)
`contacts`	object	`emails`, `phones`, `socialLinks` (if `extractContacts: true`)
`cleanText`	string	Boilerplate-free body text (if `extractCleanText: true`)
`dataQuality`	object	`{isValid, score, missingFields, emptyFields, warnings}` (if `dataQualityChecks: true`)

💰 Pricing

Pay only for what you use:

Event	Price
Actor start	$0.01 per run
Page scraped	$0.01 per page saved to the dataset

Example runs:

50 pages → $0.51
500 pages → $5.01
5,000 pages → $50.01

Proxies included. No configuration required — the actor handles proxy setup automatically so the scraper works out of the box. You can supply your own proxy URLs in proxyConfiguration if you need a specific country, ISP, or your own provider.

Platform compute and storage are billed separately by Apify at standard rates.

❓ FAQ

Is this a drop-in replacement for the Legacy PhantomJS Crawler? Input shape is intentionally close — startUrls, pageFunction, globs, excludeGlobs, pseudoUrls, linkSelector all map directly. The main difference is that pageFunction now runs in a Node.js / Playwright context with full ES2022+ support instead of PhantomJS's ES5.1.

Does adaptive mode always pick the right renderer? It catches the vast majority of cases (pure SPAs, partial-SPAs with inline data, framework root mount points, sites with enable JavaScript notices). Rare edge cases that look static but render content via deep XHR chains may still slip through — explicitly set crawlerMode: "browser" for those.

Does the stealth mode bypass Cloudflare / DataDome / PerimeterX? No. The built-in stealth defeats basic detection (webdriver flag, fingerprint patterns, header analysis). Premium anti-bot services use TLS fingerprinting, behavioural analysis, and challenge pages that need either a captcha solver (configure captchaSolverApiKey) or commercial unblocking services. This is honest scope — most crawlers in this price range have the same limitation.

Why are there two ways to define link patterns (globs vs pseudoUrls)? globs is the modern, recommended format. pseudoUrls exists for backward compatibility with the Legacy PhantomJS Crawler so existing input templates still work.

Can I send POST requests? Yes — give each start URL a method and payload:

{
  "startUrls": [
    {"url": "https://api.example.com/search", "method": "POST", "payload": "{\"query\": \"x\"}"}
  ]
}

Does it respect robots.txt? Yes by default. Set ignoreRobotsTxt: true to override (you take responsibility for compliance with the target site's terms of service).

Can I run scheduled crawls? Yes — use Apify's built-in Schedules. Great for daily site monitoring or weekly SEO audits.

Where do screenshots and snapshots go? To the run's default Key-Value store. Keys are snapshot-<requestId>.html and screenshot-<requestId>.png. Open the run in Apify Console → Storage → Key-Value store to view them.

🛠️ Maintained by

brilliant_gum — modernising legacy Apify infrastructure since 2025.

Bug reports, feature requests, custom scraping needs — open an issue on this actor's page.

If this actor saved you a migration headache, leave a ⭐ — it helps other ex-PhantomJS users find it.

Legacy PhantomJS Crawler

apify/legacy-phantomjs-crawler

Replacement for the legacy Apify Crawler product with a backward-compatible interface. The Actor uses PhantomJS headless browser to recursively crawl websites and extract data from them using a piece of front-end JavaScript code.

Apify

876

5.0

Send Legacy PhantomJS Crawler Results

drobnikj/send-crawler-results

This actor downloads results from Legacy PhantomJS Crawler task and sends them to email as attachments. It is designed to run from finish webhook.

Jakub Drobník

Stealth Scraper

shvmgrx/stealth-scraper

Shivam Goraksha

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Hounderd

Abandoned App Finder for Software Revival Ideas

happyfhantum/abandoned-software-revival-finder

Find neglected apps, freeware, and developer tools with signs that people still want a modern replacement.

Kelsey Todd

Stealth Website Crawler

nocturne/stealth-website-crawler

Crawl websites protected by Cloudflare, DataDome, and other anti-bot systems. Extract clean text or markdown for AI/LLM pipelines. Like Website Content Crawler, but for sites that block you.

Nocturne

5.0

Super Stealth Scraper — Anti-Detection Web Data Extraction

apricot_blackberry/super-stealth-scraper

Anti-detection web scraping: fingerprint rotation, residential proxies, human-like behavior. Scrape sites that block scrapers.

Creator Fusion

Stealth Website Scraper | 💰$1.5 per 1,000 results

solutionssmart/stealth-website-scraper

Extract text, links, metadata, HTML, markdown, and structured page data with HTTP-first crawling and stealth-aware browser fallback.

Solutions Smart

Amazon Data Extractor

rizvi_ahmed/amazon-data-extractor

All-in-one Amazon data extraction — search, products, reviews, stores, and Q&A. 10 domains, anti-bot stealth, auto-pagination, CAPTCHA retry, and residential proxy support. Ready to scale.

Rizvi Ahmed

Real Estate Market Intelligence Platform

dondata/real-estate-market-intelligence

A production-ready real estate market intelligence platform built with Apify, featuring automated data collection, normalization, analytics, and a modern web dashboard.