Modern Web Crawler β€” Adaptive + Stealth + Analytics avatar

Modern Web Crawler β€” Adaptive + Stealth + Analytics

Pricing

from $10.00 / 1,000 page scrapeds

Go to Apify Store
Modern Web Crawler β€” Adaptive + Stealth + Analytics

Modern Web Crawler β€” Adaptive + Stealth + Analytics

Modern replacement for the Legacy PhantomJS Crawler. Auto HTTP/Browser detection, basic anti-bot stealth, built-in analytics, data quality scoring, captcha solver integration. Modern Chrome + Cheerio engine β€” no PhantomJS, no abandoned tech. Proxies included.

Pricing

from $10.00 / 1,000 page scrapeds

Rating

0.0

(0)

Developer

Yuliia Kulakova

Yuliia Kulakova

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Auto HTTP/Browser detection. Anti-bot stealth. Captcha solver. Proxies included. Production-grade replacement for the Legacy PhantomJS Crawler β€” no PhantomJS, no abandoned browser engine.

Modern Web Crawler β€” Adaptive Β· Stealth Β· Analytics


πŸ”₯ Why migrate from PhantomJS Crawler?

The Legacy PhantomJS Crawler ships with a browser engine abandoned in 2018. It only speaks ES5.1, can't render modern HTML5/CSS3, is trivially fingerprinted by every anti-bot system, and receives no security updates.

This actor is what PhantomJS Crawler users have been asking for β€” a modern crawler with a familiar input shape (startUrls, pageFunction, globs/excludeGlobs), backed by real Chrome + Cheerio under the hood.


✨ What you get

🎯 Adaptive crawler mode

Pick HTTP, Browser, or let the crawler decide:

  • HTTP (Cheerio) β€” fast and cheap, ~10–20Γ— lighter than browser mode. Great for static pages, news sites, blogs, REST APIs.
  • Browser (Playwright + Chromium) β€” full JavaScript rendering. Required for SPAs, React/Vue/Next.js, infinite scroll, sites that fetch content via XHR.
  • Adaptive (default) β€” fetches every page with HTTP first. If the page is detected as JavaScript-rendered (empty content, SPA shell, inline var data=[...] patterns, fetch/XHR-driven rendering), the crawler remembers the domain and switches it to Browser mode for the rest of the crawl. Best of both worlds β€” no wasted browser launches on static content.

πŸ₯· Anti-bot stealth (basic protection sites)

Built-in patches that defeat common detection patterns:

  • navigator.webdriver flag masking
  • Realistic plugin / language / permission fingerprints
  • Rotating, browser-realistic User-Agent + full Sec-Fetch-* header set
  • Session pool with cookie persistence across requests
  • Per-request random delays with human-like jitter

What this defeats: sites with basic to moderate detection β€” most corporate sites, news portals, documentation pages, marketplaces without dedicated bot defence.

What this does not defeat: premium anti-bot services like Cloudflare Pro Challenge, PerimeterX / HUMAN, DataDome, Akamai Bot Manager, Imperva Incapsula. For those sites, configure the captcha solver below or use commercial unblocking services.

🧠 Captcha solver integration

Plug in your 2Captcha / Anti-Captcha / CapMonster API key in captchaSolverApiKey and the crawler will automatically detect and solve:

  • Cloudflare Turnstile
  • reCAPTCHA v2 / v3
  • hCaptcha
  • Image / text challenges

Captcha-protected pages get a second pass after the solver returns the token.

πŸ“Š Built-in analytics

Every crawl produces a comprehensive report saved to the run's Key-Value store as ANALYTICS_REPORT:

  • Success rate, error breakdown by category
  • Response time percentiles (avg, median, p95, p99)
  • HTTP vs Browser distribution + adaptive switch count
  • Per-domain crawl speed
  • Total bandwidth consumed
  • Top 10 errors with URLs (for debugging)
  • Depth distribution histogram

No extra setup β€” just one toggle (enableAnalytics, default on).

βœ… Data quality score

Every item gets a dataQuality block β€” a 0–100 score plus a breakdown of what's missing:

{
"isValid": true,
"score": 100,
"missingFields": [],
"emptyFields": [],
"warnings": []
}
  • score β€” 0–100 completeness rating. With requiredFields set, it's the percentage of those fields that are present and non-empty. Without requiredFields, the score reflects general data issues (encoding, whitespace-only values, etc.).
  • isValid β€” true when all required fields are present and non-empty.
  • missingFields / emptyFields β€” which required fields didn't make it.
  • warnings β€” encoding issues, replacement characters, whitespace-only values.

Validate against any field from the merged item (top-level and your page function's return value). Filter or sort your dataset by dataQuality.score to surface the best results first.

πŸͺ² Error debugging

When a page fails:

  • Categorized error type (Timeout, Blocked, RateLimited, Navigation, JsError, …)
  • Optional HTML snapshot saved to Key-Value store (saveSnapshots: true, default on)
  • Optional screenshot in browser mode (saveScreenshotPerPage: true)

Makes "why did this page fail?" answerable in seconds.


πŸ’‘ Use cases

AudienceExample use
Migration from PhantomJS CrawlerDrop-in replacement β€” same pageFunction shape, same startUrls/globs filters
SEO researchCrawl a domain, extract Open Graph + Twitter + JSON-LD metadata at scale
Lead generationExtract contacts (emails, phones, social profiles) from company sites
Content monitoringWatch a list of URLs for changes β€” adaptive rendering catches client-side updates
AI training dataCrawl reference sites, output clean text per page for embeddings
QA / complianceValidate every page in a site against a schema (e.g. "every product page must have price, availability, sku")

πŸš€ Quick start

Minimum input β€” start from a URL, accept all defaults:

{
"startUrls": [{"url": "https://example.com/"}]
}

This crawls example.com in adaptive mode, max 100 pages, depth 10, extracts metadata, validates output. The default page function returns { url, title, description, h1, canonicalUrl }.


πŸ“₯ Common inputs

Limit the crawl

{
"startUrls": [{"url": "https://example.com/"}],
"maxCrawlPages": 50,
"maxCrawlDepth": 3,
"maxConcurrency": 10
}

Filter URLs with globs

{
"startUrls": [{"url": "https://example.com/"}],
"globs": [{"glob": "https://example.com/products/**"}],
"excludeGlobs": [
{"glob": "**/login**"},
{"glob": "**/account/**"}
]
}

Custom page function (extract structured data)

{
"startUrls": [{"url": "https://quotes.toscrape.com/"}],
"maxCrawlPages": 10,
"pageFunction": "async function pageFunction(context) {\n const { $, request } = context;\n const quotes = [];\n $('div.quote').each((i, el) => {\n quotes.push({\n text: $(el).find('span.text').text(),\n author: $(el).find('small.author').text()\n });\n });\n return { url: request.loadedUrl, quotes };\n}"
}

Force browser mode (for JS-heavy sites)

{
"startUrls": [{"url": "https://app.example.com/"}],
"crawlerMode": "browser",
"waitUntil": "networkidle"
}

Premium anti-bot site (captcha solver)

{
"startUrls": [{"url": "https://protected-site.com/"}],
"crawlerMode": "browser",
"stealthMode": true,
"captchaSolverApiKey": "YOUR_2CAPTCHA_KEY"
}

πŸ“€ Output

One item per crawled page. Top-level fields the actor always populates:

FieldTypeDescription
urlstringOriginal request URL
loadedUrlstringFinal URL after redirects
httpStatusnumberFinal HTTP status code
titlestring<title> of the page
typestring"StartUrl" or "FoundLink"
referrerUrlstringPage that linked to this one (null for start URLs)
crawlDepthnumberLink distance from a start URL
crawlerModestring"http" or "browser" for this specific page
loadTimeMsnumberPage load time
downloadedBytesnumberResponse body size
requestedAt / loadingFinishedAt / timestampISO datePer-request timing
requestIdstringUnique ID for cross-reference
retryCountnumberRetries before success
methodstringHTTP method used
responseHeadersobjectAll response headers
pageFunctionResultobjectWhatever your pageFunction returned
metadataobjectog, twitter, jsonLd, meta (if extractMetadata: true)
contactsobjectemails, phones, socialLinks (if extractContacts: true)
cleanTextstringBoilerplate-free body text (if extractCleanText: true)
dataQualityobject{isValid, score, missingFields, emptyFields, warnings} (if dataQualityChecks: true)

πŸ’° Pricing

Pay only for what you use:

EventPrice
Actor start$0.01 per run
Page scraped$0.01 per page saved to the dataset

Example runs:

  • 50 pages β†’ $0.51
  • 500 pages β†’ $5.01
  • 5,000 pages β†’ $50.01

Proxies included. No configuration required β€” the actor handles proxy setup automatically so the scraper works out of the box. You can supply your own proxy URLs in proxyConfiguration if you need a specific country, ISP, or your own provider.

Platform compute and storage are billed separately by Apify at standard rates.


❓ FAQ

Is this a drop-in replacement for the Legacy PhantomJS Crawler? Input shape is intentionally close β€” startUrls, pageFunction, globs, excludeGlobs, pseudoUrls, linkSelector all map directly. The main difference is that pageFunction now runs in a Node.js / Playwright context with full ES2022+ support instead of PhantomJS's ES5.1.

Does adaptive mode always pick the right renderer? It catches the vast majority of cases (pure SPAs, partial-SPAs with inline data, framework root mount points, sites with enable JavaScript notices). Rare edge cases that look static but render content via deep XHR chains may still slip through β€” explicitly set crawlerMode: "browser" for those.

Does the stealth mode bypass Cloudflare / DataDome / PerimeterX? No. The built-in stealth defeats basic detection (webdriver flag, fingerprint patterns, header analysis). Premium anti-bot services use TLS fingerprinting, behavioural analysis, and challenge pages that need either a captcha solver (configure captchaSolverApiKey) or commercial unblocking services. This is honest scope β€” most crawlers in this price range have the same limitation.

Why are there two ways to define link patterns (globs vs pseudoUrls)? globs is the modern, recommended format. pseudoUrls exists for backward compatibility with the Legacy PhantomJS Crawler so existing input templates still work.

Can I send POST requests? Yes β€” give each start URL a method and payload:

{
"startUrls": [
{"url": "https://api.example.com/search", "method": "POST", "payload": "{\"query\": \"x\"}"}
]
}

Does it respect robots.txt? Yes by default. Set ignoreRobotsTxt: true to override (you take responsibility for compliance with the target site's terms of service).

Can I run scheduled crawls? Yes β€” use Apify's built-in Schedules. Great for daily site monitoring or weekly SEO audits.

Where do screenshots and snapshots go? To the run's default Key-Value store. Keys are snapshot-<requestId>.html and screenshot-<requestId>.png. Open the run in Apify Console β†’ Storage β†’ Key-Value store to view them.


πŸ› οΈ Maintained by

brilliant_gum β€” modernising legacy Apify infrastructure since 2025.

Bug reports, feature requests, custom scraping needs β€” open an issue on this actor's page.

If this actor saved you a migration headache, leave a ⭐ β€” it helps other ex-PhantomJS users find it.