Smart Page Fetcher avatar

Smart Page Fetcher

Pricing

from $0.43 / 1,000 page fetched (basic tier)s

Go to Apify Store
Smart Page Fetcher

Smart Page Fetcher

Fetch a batch of URLs adaptively: cheap HTTP for static pages, browser render for JavaScript pages, stealth+residential proxy only for actively defended pages. Pay per URL by the difficulty that actually worked, with browser launches amortized across the batch.

Pricing

from $0.43 / 1,000 page fetched (basic tier)s

Rating

0.0

(0)

Developer

Scott Helvick

Scott Helvick

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

0

Monthly active users

2 days ago

Last modified

Share

Fetch a batch of URLs adaptively. The Actor tries the cheapest method that works on each URL — plain HTTP first, then a real browser if JavaScript is needed, then a stealth + residential-proxy path only for actively defended pages. You pay only for the difficulty each URL actually needed, with browser startup amortized across the whole batch.

What this does

Submit a list of URLs. The Actor walks each URL up an escalation chain until something works, then pushes one dataset record per URL with the requested output formats.

  • Tier 1 — basic HTTP. Plain GET, no JavaScript, no proxy. Fast and cheap. Good for static pages, JSON-LD-heavy product pages, documentation, RSS-style content.
  • Tier 2 — JavaScript render. Real browser, no stealth shims. Loads the page, runs JS, captures the rendered DOM. Good for SPAs and lazy-rendered content.
  • Tier 3 — stealth + residential proxy. Hardened browser session routed through residential IPs in the country of your choice. Used only when the cheaper tiers can't get past bot defenses.

Each tier can be locked on or off per request. The default is auto on all three — escalate from cheapest, stop as soon as a tier returns usable content.

Output formats are derived from the same fetched HTML, no extra fetch charge per format:

  • html — raw HTML, returned as a Key-Value Store URL (so the dataset record stays small). Byte-for-byte what the target returned — no scripts, banners, or wrappers injected into the response.
  • text — boilerplate-stripped visible text (LLM-friendly)
  • markdown — page content as Markdown
  • links — every anchor as {url, text, title} with relative hrefs resolved
  • json_ld — every <script type="application/ld+json"> block, parsed
  • og — OpenGraph values (title, description, image, url, type, site_name)
  • meta — other meta tags as a flat dict (description, canonical, viewport, twitter:*, etc.)
  • a11y — browser accessibility tree as JSON (tier 2/3 only)
  • screenshot — full-page PNG (tier 2/3 only)

html, a11y, and screenshot are uploaded to the Apify Key-Value Store and the dataset record stores a public URL. Smaller structured outputs stay inline.

Why adaptive tiering matters

The cost gap between fetch methods is enormous. A plain HTTP request takes 100ms and ~$0.0005 of resources. A full stealth render against a Cloudflare-protected site can take 30 seconds and burn an entire residential proxy session worth ~$0.03. If you pre-commit to one method, you either overpay for easy pages or fail on hard ones.

A worse failure mode: the page renders but is silently wrong. Bot defenses sometimes serve a 200 with a JavaScript challenge interstitial that looks like a page to a naive HTTP client — the LLM downstream gets a string of obfuscated JS instead of the article body and has no way to know.

This Actor solves both ends. It picks the right method per URL — paying basic-tier rates for the 90% of URLs that don't need anything fancy, and only escalating to the expensive paths when the cheaper ones return content that fails an escalation check (known anti-bot markers, JS-required signals, the typical 403/429/503 status codes from bot defenses). The customer never has to think about which tier is right; they just submit URLs.

Batches amortize the fixed costs too. A real browser takes 3-5 seconds to launch — paid once per batch, not once per URL. Pass 50 URLs and the browser cost spreads across all of them.

Use cases

  • Fetch a mixed batch of public pages where you don't know which will be easy and which will be defended
  • Build a corpus of pages for downstream LLM extraction (use outputs: ["markdown"] and let the Actor pick the cheapest tier that returns the rendered content)
  • Refresh structured data (JSON-LD, OpenGraph) across a list of product / article / event URLs
  • Capture screenshots and accessibility trees across a list of pages for QA or audit purposes
  • Pull pages from a specific country (residential proxy, tier 3, via the country field)

How it compares to fixed-method scrapers

ApproachStatic HTMLJavaScript-renderedBot-defendedCost on easy pages
Plain HTTP fetchcheapest
Always-stealth fetcheroverpaying for easy pages
Smart Page Fetcherbasic-tier price

You don't pay tier-2 rates on a tier-1 page and you don't fail on a tier-3 page. One callable surface, one batch input, one set of output formats — regardless of which tier each URL ended up on.

Input

FieldTypeRequiredDefaultDescription
urlsarrayList of URLs to fetch, 1-500 per batch. Each entry is either a plain URL string or an object {"url": "...", "headers": {...}}. URLs must start with http:// or https://.
basicenum: auto / true / falseautoControls the basic HTTP tier. auto: included in the escalation chain. true: starts here. false: skipped.
jsenum: auto / true / falseautoControls the JS render tier.
stealthenum: auto / true / falseautoControls the stealth tier.
outputsarray of strings["html", "markdown"]Any combination of html, text, markdown, a11y, screenshot, links, json_ld, og, meta.
runtime_budget_msinteger (30000–3600000)270000Total wall-clock budget for the whole batch. Unprocessed URLs come back as deferred (zero charge). The default keeps synchronous callers under Apify's 5-minute sync API limit with headroom.
countrystring (ISO-2)Optional country code (e.g. US, GB, DE) forwarded to the stealth tier's residential proxy. Ignored by basic and JS tiers.

The convenience input url: "<single-url>" is accepted silently as syntactic sugar for urls: ["<single-url>"].

Per-URL request headers

When a urls entry uses the object form, the headers map sets request headers for that URL only — they don't leak to other URLs in the same batch. Same effect across all three tiers (httpx GET, Playwright context, stealth backend).

Allowed header names (case-insensitive):

  • Accept
  • Accept-Language
  • Accept-Encoding
  • User-Agent
  • Referer
  • Content-Type

Anything else — Cookie, Authorization, Proxy-Authorization, any X-* header, Origin, etc. — is rejected at input validation time. The Actor is a general-purpose unauthenticated fetcher; allowing credential or session headers would turn it into an authenticated-session proxy on demand. Use a purpose-built Actor for authenticated scraping.

Example: route Reddit-style listings through the JSON variant by setting Accept:

{
"urls": [
"https://example.com",
{
"url": "https://old.reddit.com/r/programming/.json",
"headers": { "Accept": "application/json" }
}
]
}

Output

One dataset record per URL, in input order.

Success record (a tier returned usable content):

{
"url": "https://example.com",
"status": "success",
"realized_tier": "basic",
"attempted_tiers": ["basic"],
"final_url": "https://example.com",
"response_status": 200,
"outputs": {
"html": "https://api.apify.com/v2/key-value-stores/.../records/html-0.html",
"markdown": "This domain is for use in documentation examples..."
}
}

Failure record (every allowed tier was tried and errored):

{
"url": "https://example.com",
"status": "failed",
"attempted_tiers": ["basic", "js", "stealth"],
"tier_errors": {
"basic": "http_404",
"js": "navigation_failed",
"stealth": "upstream_unsolved"
}
}

Deferred record (runtime budget exhausted before the URL was attempted):

{
"url": "https://example.com",
"status": "deferred",
"reason": "runtime_budget_exhausted",
"attempted_tiers": []
}

Only success records trigger a fetch charge. failed and deferred are zero-charge.

The run's OUTPUT.json (visible in the run summary) is a small batch-level object:

{
"batch_size": 50,
"by_tier": { "basic": 41, "js": 7, "stealth": 1 },
"failed": 1,
"deferred": 0,
"duration_ms": 28430,
"runtime_budget_exhausted": false
}

Example

{
"urls": [
"https://example.com",
"https://news.ycombinator.com",
"https://www.python.org/"
],
"outputs": ["markdown", "links", "og"]
}

Via the API:

curl -X POST "https://api.apify.com/v2/acts/shelvick~smart-page-fetcher/run-sync-get-dataset-items?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com"], "outputs": ["markdown"]}'

run-sync-get-dataset-items blocks until the run finishes and returns the dataset records directly. Use the async run endpoint for batches over ~50 URLs (Apify's sync endpoint has a 5-minute cap).

Calling from an AI agent

The Actor is designed for agent discovery and invocation.

Apify MCP server (mcp.apify.com): the Actor surfaces as a callable tool. The input schema is self-documenting, so an LLM can construct correct calls from the tool description without external context. Pay per call via x402 USDC on Base or Skyfire managed tokens.

Apify SDK (Python):

from apify_client import ApifyClient
client = ApifyClient(token=API_TOKEN)
run = client.actor("shelvick/smart-page-fetcher").call(
run_input={"urls": ["https://example.com"], "outputs": ["markdown"]}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["url"], item["status"], item.get("outputs", {}).get("markdown"))

REST API: /run-sync-get-dataset-items for batches that fit under 5 minutes, the async /runs endpoint for larger batches.

Pricing

Pay-per-event, billed only on success. Each URL is charged once at the tier that actually produced its content — basic, js, or stealth. failed and deferred URLs are free. A single Actor-start event is amortized across the whole batch. Higher tiers cost more because they involve more infrastructure (a real browser at tier 2, a real browser plus residential proxy at tier 3); on a typical mixed batch most URLs land on tier 1 and the effective per-URL cost is dominated by that floor.

See the Pricing tab on this Store page for the current per-event rates and any active subscriber discounts.

Errors

The run itself is marked FAILED only on input validation problems:

  • urls empty or missing, with no url either
  • A URL doesn't start with http:// or https://
  • A urls object-form entry carries a header name outside the allowlist (see Per-URL request headers above), or an unknown key
  • All three tier flags set to false (no tier allowed)
  • screenshot or a11y requested but the tier chain excludes JS and stealth

Per-URL fetch failures don't fail the run — they land in the dataset as failed records with a tier_errors map. Common reasons:

  • http_404 / http_410 / http_5xx — target returned a terminal HTTP error
  • non_html_content_type: <type> — target returned binary, JSON, or PDF; we only handle HTML/XML
  • navigation_failed — browser couldn't reach the page (DNS, TLS, timeout)
  • upstream_unsolved — stealth tier couldn't bypass the target's bot defenses
  • upstream_policy_block — stealth backend refused the URL by policy (target on its denylist); not retried
  • upstream_rate_limited — backend hit a rate limit; retried internally before giving up

Performance expectations

Latency depends on the realized tier:

  • Tier 1: 100–500 ms per URL, fired in parallel up to 50 at a time
  • Tier 2: 3–10 s per URL, ~3 in parallel sharing one browser (3-5 s of browser startup amortized across the batch)
  • Tier 3: 15–90 s per URL, ~5 in parallel; the long tail is bot-defense challenge solving

A typical 50-URL mixed batch finishes in 30-90 seconds. Pure tier-1 batches finish in 2-5 seconds. The Actor's runtime budget defaults to 4 minutes 30 seconds (270000 ms) — raise it up to 60 minutes for very large batches.

FAQ

What if I know all my URLs need JavaScript? Set basic: "false". The chain starts at JS and skips the wasted tier-1 attempt.

What if I know my URLs are heavily defended? Set stealth: "true". The chain starts directly at the stealth tier — saves the cost of two failed lower tiers.

Can I cap costs at JS-tier price and refuse stealth? Yes: stealth: "false". URLs that fail JS will come back as failed records (zero charge) instead of escalating to the expensive tier.

What if my batch is too big to finish in 5 minutes? Raise runtime_budget_ms up to 3600000 (60 minutes) and use Apify's async run endpoint. Anything not finished before the budget runs out returns as deferred and you can retry just those.

Do I get charged if a fetch fails? No. Charges fire only after a success record is pushed to the dataset. Failed and deferred URLs cost you nothing per-URL — the actor-start fee is the only baseline.

Do screenshot and a11y work on tier 1? No — they require a real browser. Requesting them with js: "false" and stealth: "false" is a configuration error and the run will fail at validation time. With auto, the start tier is bumped automatically.

Can I crawl a site with this? Not directly. The Actor takes a list of URLs and fetches them; it does not follow links. Use the links output to feed a second invocation if you want a one-level crawl.

Is the returned HTML unmodified? Yes — byte-for-byte what the target server returned, with no scripts or wrappers injected by us or by the storage layer. Useful when feeding the HTML to a DOM parser, a diff tool, or an LLM that's particular about its inputs.

What this doesn't do

  • No authentication. Read-only, unauthenticated. Per-URL headers are limited to content-negotiation and polite-identification headers (Accept, Accept-Language, User-Agent, Referer, etc.); credential or session headers (Cookie, Authorization, X-*) are rejected at input validation. The fetch is anonymous from the target's perspective.
  • No forms or interactions. This is a fetcher, not a browser-automation tool. Use a dedicated Actor for clicking, scrolling, or form submission.
  • No automatic pagination. Pass the paginated URLs as a batch yourself.
  • No PDF or binary content. HTML/XML only. Non-HTML responses come back as failed with non_html_content_type.
  • No retries on terminal failures. The stealth tier retries internally on transient backend errors, but a 404 or unsolved challenge is final — we don't re-queue it.

For workflows that need any of these, this Actor is the right primitive to build on — call it from your own orchestrator and handle the higher-level loop there.