Modern Web Crawler β Adaptive + Stealth + Analytics
Pricing
from $10.00 / 1,000 page scrapeds
Modern Web Crawler β Adaptive + Stealth + Analytics
Modern replacement for the Legacy PhantomJS Crawler. Auto HTTP/Browser detection, basic anti-bot stealth, built-in analytics, data quality scoring, captcha solver integration. Modern Chrome + Cheerio engine β no PhantomJS, no abandoned tech. Proxies included.
Pricing
from $10.00 / 1,000 page scrapeds
Rating
0.0
(0)
Developer
Yuliia Kulakova
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Auto HTTP/Browser detection. Anti-bot stealth. Captcha solver. Proxies included. Production-grade replacement for the Legacy PhantomJS Crawler β no PhantomJS, no abandoned browser engine.

π₯ Why migrate from PhantomJS Crawler?
The Legacy PhantomJS Crawler ships with a browser engine abandoned in 2018. It only speaks ES5.1, can't render modern HTML5/CSS3, is trivially fingerprinted by every anti-bot system, and receives no security updates.
This actor is what PhantomJS Crawler users have been asking for β a modern crawler with a familiar input shape (startUrls, pageFunction, globs/excludeGlobs), backed by real Chrome + Cheerio under the hood.
β¨ What you get
π― Adaptive crawler mode
Pick HTTP, Browser, or let the crawler decide:
- HTTP (Cheerio) β fast and cheap, ~10β20Γ lighter than browser mode. Great for static pages, news sites, blogs, REST APIs.
- Browser (Playwright + Chromium) β full JavaScript rendering. Required for SPAs, React/Vue/Next.js, infinite scroll, sites that fetch content via XHR.
- Adaptive (default) β fetches every page with HTTP first. If the page is detected as JavaScript-rendered (empty content, SPA shell, inline
var data=[...]patterns, fetch/XHR-driven rendering), the crawler remembers the domain and switches it to Browser mode for the rest of the crawl. Best of both worlds β no wasted browser launches on static content.
π₯· Anti-bot stealth (basic protection sites)
Built-in patches that defeat common detection patterns:
navigator.webdriverflag masking- Realistic plugin / language / permission fingerprints
- Rotating, browser-realistic User-Agent + full
Sec-Fetch-*header set - Session pool with cookie persistence across requests
- Per-request random delays with human-like jitter
What this defeats: sites with basic to moderate detection β most corporate sites, news portals, documentation pages, marketplaces without dedicated bot defence.
What this does not defeat: premium anti-bot services like Cloudflare Pro Challenge, PerimeterX / HUMAN, DataDome, Akamai Bot Manager, Imperva Incapsula. For those sites, configure the captcha solver below or use commercial unblocking services.
π§ Captcha solver integration
Plug in your 2Captcha / Anti-Captcha / CapMonster API key in captchaSolverApiKey and the crawler will automatically detect and solve:
- Cloudflare Turnstile
- reCAPTCHA v2 / v3
- hCaptcha
- Image / text challenges
Captcha-protected pages get a second pass after the solver returns the token.
π Built-in analytics
Every crawl produces a comprehensive report saved to the run's Key-Value store as ANALYTICS_REPORT:
- Success rate, error breakdown by category
- Response time percentiles (avg, median, p95, p99)
- HTTP vs Browser distribution + adaptive switch count
- Per-domain crawl speed
- Total bandwidth consumed
- Top 10 errors with URLs (for debugging)
- Depth distribution histogram
No extra setup β just one toggle (enableAnalytics, default on).
β Data quality score
Every item gets a dataQuality block β a 0β100 score plus a breakdown of what's missing:
{"isValid": true,"score": 100,"missingFields": [],"emptyFields": [],"warnings": []}
scoreβ 0β100 completeness rating. WithrequiredFieldsset, it's the percentage of those fields that are present and non-empty. WithoutrequiredFields, the score reflects general data issues (encoding, whitespace-only values, etc.).isValidβtruewhen all required fields are present and non-empty.missingFields/emptyFieldsβ which required fields didn't make it.warningsβ encoding issues, replacement characters, whitespace-only values.
Validate against any field from the merged item (top-level and your page function's return value). Filter or sort your dataset by dataQuality.score to surface the best results first.
πͺ² Error debugging
When a page fails:
- Categorized error type (Timeout, Blocked, RateLimited, Navigation, JsError, β¦)
- Optional HTML snapshot saved to Key-Value store (
saveSnapshots: true, default on) - Optional screenshot in browser mode (
saveScreenshotPerPage: true)
Makes "why did this page fail?" answerable in seconds.
π‘ Use cases
| Audience | Example use |
|---|---|
| Migration from PhantomJS Crawler | Drop-in replacement β same pageFunction shape, same startUrls/globs filters |
| SEO research | Crawl a domain, extract Open Graph + Twitter + JSON-LD metadata at scale |
| Lead generation | Extract contacts (emails, phones, social profiles) from company sites |
| Content monitoring | Watch a list of URLs for changes β adaptive rendering catches client-side updates |
| AI training data | Crawl reference sites, output clean text per page for embeddings |
| QA / compliance | Validate every page in a site against a schema (e.g. "every product page must have price, availability, sku") |
π Quick start
Minimum input β start from a URL, accept all defaults:
{"startUrls": [{"url": "https://example.com/"}]}
This crawls example.com in adaptive mode, max 100 pages, depth 10, extracts metadata, validates output. The default page function returns { url, title, description, h1, canonicalUrl }.
π₯ Common inputs
Limit the crawl
{"startUrls": [{"url": "https://example.com/"}],"maxCrawlPages": 50,"maxCrawlDepth": 3,"maxConcurrency": 10}
Filter URLs with globs
{"startUrls": [{"url": "https://example.com/"}],"globs": [{"glob": "https://example.com/products/**"}],"excludeGlobs": [{"glob": "**/login**"},{"glob": "**/account/**"}]}
Custom page function (extract structured data)
{"startUrls": [{"url": "https://quotes.toscrape.com/"}],"maxCrawlPages": 10,"pageFunction": "async function pageFunction(context) {\n const { $, request } = context;\n const quotes = [];\n $('div.quote').each((i, el) => {\n quotes.push({\n text: $(el).find('span.text').text(),\n author: $(el).find('small.author').text()\n });\n });\n return { url: request.loadedUrl, quotes };\n}"}
Force browser mode (for JS-heavy sites)
{"startUrls": [{"url": "https://app.example.com/"}],"crawlerMode": "browser","waitUntil": "networkidle"}
Premium anti-bot site (captcha solver)
{"startUrls": [{"url": "https://protected-site.com/"}],"crawlerMode": "browser","stealthMode": true,"captchaSolverApiKey": "YOUR_2CAPTCHA_KEY"}
π€ Output
One item per crawled page. Top-level fields the actor always populates:
| Field | Type | Description |
|---|---|---|
url | string | Original request URL |
loadedUrl | string | Final URL after redirects |
httpStatus | number | Final HTTP status code |
title | string | <title> of the page |
type | string | "StartUrl" or "FoundLink" |
referrerUrl | string | Page that linked to this one (null for start URLs) |
crawlDepth | number | Link distance from a start URL |
crawlerMode | string | "http" or "browser" for this specific page |
loadTimeMs | number | Page load time |
downloadedBytes | number | Response body size |
requestedAt / loadingFinishedAt / timestamp | ISO date | Per-request timing |
requestId | string | Unique ID for cross-reference |
retryCount | number | Retries before success |
method | string | HTTP method used |
responseHeaders | object | All response headers |
pageFunctionResult | object | Whatever your pageFunction returned |
metadata | object | og, twitter, jsonLd, meta (if extractMetadata: true) |
contacts | object | emails, phones, socialLinks (if extractContacts: true) |
cleanText | string | Boilerplate-free body text (if extractCleanText: true) |
dataQuality | object | {isValid, score, missingFields, emptyFields, warnings} (if dataQualityChecks: true) |
π° Pricing
Pay only for what you use:
| Event | Price |
|---|---|
| Actor start | $0.01 per run |
| Page scraped | $0.01 per page saved to the dataset |
Example runs:
- 50 pages β $0.51
- 500 pages β $5.01
- 5,000 pages β $50.01
Proxies included. No configuration required β the actor handles proxy setup automatically so the scraper works out of the box. You can supply your own proxy URLs in proxyConfiguration if you need a specific country, ISP, or your own provider.
Platform compute and storage are billed separately by Apify at standard rates.
β FAQ
Is this a drop-in replacement for the Legacy PhantomJS Crawler?
Input shape is intentionally close β startUrls, pageFunction, globs, excludeGlobs, pseudoUrls, linkSelector all map directly. The main difference is that pageFunction now runs in a Node.js / Playwright context with full ES2022+ support instead of PhantomJS's ES5.1.
Does adaptive mode always pick the right renderer?
It catches the vast majority of cases (pure SPAs, partial-SPAs with inline data, framework root mount points, sites with enable JavaScript notices). Rare edge cases that look static but render content via deep XHR chains may still slip through β explicitly set crawlerMode: "browser" for those.
Does the stealth mode bypass Cloudflare / DataDome / PerimeterX?
No. The built-in stealth defeats basic detection (webdriver flag, fingerprint patterns, header analysis). Premium anti-bot services use TLS fingerprinting, behavioural analysis, and challenge pages that need either a captcha solver (configure captchaSolverApiKey) or commercial unblocking services. This is honest scope β most crawlers in this price range have the same limitation.
Why are there two ways to define link patterns (globs vs pseudoUrls)?
globs is the modern, recommended format. pseudoUrls exists for backward compatibility with the Legacy PhantomJS Crawler so existing input templates still work.
Can I send POST requests?
Yes β give each start URL a method and payload:
{"startUrls": [{"url": "https://api.example.com/search", "method": "POST", "payload": "{\"query\": \"x\"}"}]}
Does it respect robots.txt?
Yes by default. Set ignoreRobotsTxt: true to override (you take responsibility for compliance with the target site's terms of service).
Can I run scheduled crawls? Yes β use Apify's built-in Schedules. Great for daily site monitoring or weekly SEO audits.
Where do screenshots and snapshots go?
To the run's default Key-Value store. Keys are snapshot-<requestId>.html and screenshot-<requestId>.png. Open the run in Apify Console β Storage β Key-Value store to view them.
π οΈ Maintained by
brilliant_gum β modernising legacy Apify infrastructure since 2025.
Bug reports, feature requests, custom scraping needs β open an issue on this actor's page.
If this actor saved you a migration headache, leave a β β it helps other ex-PhantomJS users find it.