Pricing

from $0.43 / 1,000 page fetched (basic tier)s

Smart Page Fetcher — HTML, Markdown & Text

Fetch a batch of URLs and get the page as HTML, Markdown, or clean text. Tries plain HTTP first, renders JavaScript in a real browser when needed, and escalates to stealth + residential proxy for Cloudflare-protected, bot-defended pages, per URL. Pay only for the difficulty each URL needed.

Pricing

from $0.43 / 1,000 page fetched (basic tier)s

Rating

0.0

(0)

Developer

Scott Helvick

Actor stats

Bookmarked

Total users

Monthly active users

11 days ago

Last modified

What this does

Submit a list of URLs. The Actor walks each URL up an escalation chain until something works, then pushes one dataset record per URL with the requested output formats.

Tier 1 — basic HTTP. Plain GET, no JavaScript, no proxy. Fast and cheap. Good for static pages, JSON-LD-heavy product pages, documentation, RSS-style content.
Tier 2 — JavaScript render. Real browser, no stealth shims. Loads the page, runs JS, captures the rendered DOM. Good for SPAs and lazy-rendered content.
Tier 3 — stealth + residential proxy. Hardened browser session routed through residential IPs in the country of your choice. Used only when the cheaper tiers can't get past bot defenses.

Each tier can be locked on or off per request. The default is auto on all three — escalate from cheapest, stop as soon as a tier returns usable content.

Output formats are derived from the same fetched HTML, no extra fetch charge per format:

html — raw HTML, returned as a Key-Value Store URL (so the dataset record stays small). Byte-for-byte what the target returned — no scripts, banners, or wrappers injected into the response.
cleaned_html — HTML with scripts, styles, tracking pixels, comments, and hidden elements stripped. Preserves semantic structure (headings, paragraphs, lists, tables, links, images). Use when you want to parse or process the DOM without the typical 50-200KB of non-content overhead.
text — boilerplate-stripped visible text (LLM-friendly)
markdown — page content as Markdown
links — every anchor as {url, text, title} with relative hrefs resolved
media — every image, video, and audio element as {url, type, alt} with relative URLs resolved
headings — document heading outline as [{level, text}] in document order (h1-h6)
tables — every <table> as {headers, rows, caption?} — headers from <thead> or first-row <th> elements, rows as arrays of cell text
json_ld — every <script type="application/ld+json"> block, parsed
og — OpenGraph values (title, description, image, url, type, site_name)
meta — other meta tags as a flat dict (description, canonical, viewport, twitter:*, etc.)
a11y — browser accessibility tree as JSON (tier 2/3 only)
screenshot — full-page PNG (tier 2/3 only)

html, a11y, and screenshot are uploaded to the Apify Key-Value Store and the dataset record stores a public URL. Smaller structured outputs stay inline.

Common workflows this enables:

Fetch a mixed batch of public pages where you don't know which will be easy and which will be defended
Build a corpus of pages for downstream LLM extraction (use outputs: ["markdown"] and let the Actor pick the cheapest tier that returns the rendered content)
Refresh structured data (JSON-LD, OpenGraph) across a list of product / article / event URLs
Capture screenshots and accessibility trees across a list of pages for QA or audit purposes
Pull pages from a specific country (residential proxy, tier 3, via the country field)

Why adaptive tiering matters

The cost gap between fetch methods is enormous. A plain HTTP request takes about 100ms and a fraction of a cent of resources. A full stealth render against a Cloudflare-protected site can take 30 seconds and burn an entire residential proxy session worth orders of magnitude more. If you pre-commit to one method, you either overpay for easy pages or fail on hard ones.

A worse failure mode: the page renders but is silently wrong. Bot defenses sometimes serve a 200 with a JavaScript challenge interstitial that looks like a page to a naive HTTP client — the LLM downstream gets a string of obfuscated JS instead of the article body and has no way to know.

This Actor solves both ends. It picks the right method per URL — paying basic-tier rates for the 90% of URLs that don't need anything fancy, and only escalating to the expensive paths when the cheaper ones return content that fails an escalation check (known anti-bot markers, JS-required signals, the typical 403/429/503 status codes from bot defenses). The customer never has to think about which tier is right; they just submit URLs.

Batches amortize the fixed costs too. A real browser takes 3-5 seconds to launch — paid once per batch, not once per URL. Pass 50 URLs and the browser cost spreads across all of them.

How it compares to fixed-method scrapers

Approach	Static HTML	JavaScript-rendered	Bot-defended	Cost on easy pages
Plain HTTP fetch	✓	✗	✗	cheapest
Always-stealth fetcher	✓	✓	✓	overpaying for easy pages
Smart Page Fetcher	✓	✓	✓	basic-tier price

You don't pay tier-2 rates on a tier-1 page and you don't fail on a tier-3 page. One callable surface, one batch input, one set of output formats — regardless of which tier each URL ended up on.

Input

Field	Type	Required	Default	Description
`urls`	array	✓	—	List of URLs to fetch, 1-500 per batch. Each entry is either a plain URL string or an object `{"url": "...", "headers": {...}}`. URLs must start with `http://` or `https://`.
`basic`	enum: `auto` / `true` / `false`		`auto`	Controls the basic HTTP tier. `auto`: included in the escalation chain. `true`: starts here. `false`: skipped.
`js`	enum: `auto` / `true` / `false`		`auto`	Controls the JS render tier.
`stealth`	enum: `auto` / `true` / `false`		`auto`	Controls the stealth tier.
`outputs`	array of strings		`["html", "markdown"]`	Any combination of `html`, `cleaned_html`, `text`, `markdown`, `links`, `media`, `headings`, `tables`, `json_ld`, `og`, `meta`, `a11y`, `screenshot`.
`runtime_budget_ms`	integer (30000–3600000)		`270000`	Total wall-clock budget for the whole batch. Unprocessed URLs come back as `deferred` (zero charge). The default keeps synchronous callers under Apify's 5-minute sync API limit with headroom.
`country`	string (ISO-2)		—	Optional country code (e.g. `US`, `GB`, `DE`) forwarded to the stealth tier's residential proxy. Ignored by basic and JS tiers.

The convenience input url: "<single-url>" is accepted silently as syntactic sugar for urls: ["<single-url>"].

Per-URL request headers

When a urls entry uses the object form, the headers map sets request headers for that URL only — they don't leak to other URLs in the same batch. Same effect across all three tiers (raw HTTP, browser render, stealth backend).

Allowed header names (case-insensitive):

Accept
Accept-Language
Accept-Encoding
User-Agent
Referer
Content-Type

Anything else — Cookie, Authorization, Proxy-Authorization, any X-* header, Origin, etc. — is rejected at input validation time. The Actor is a general-purpose unauthenticated fetcher; allowing credential or session headers would turn it into an authenticated-session proxy on demand. Use a purpose-built Actor for authenticated scraping.

Example: route Reddit-style listings through the JSON variant by setting Accept:

{
  "urls": [
    "https://example.com",
    {
      "url": "https://old.reddit.com/r/programming/.json",
      "headers": { "Accept": "application/json" }
    }
  ]
}

Output

One dataset record per URL, in input order.

Success record (a tier returned usable content):

{
  "url": "https://example.com",
  "status": "success",
  "realized_tier": "basic",
  "attempted_tiers": ["basic"],
  "final_url": "https://example.com",
  "response_status": 200,
  "outputs": {
    "html": "https://api.apify.com/v2/key-value-stores/.../records/html-0.html",
    "markdown": "This domain is for use in documentation examples..."
  }
}

Failure record (every allowed tier was tried and errored):

{
  "url": "https://example.com",
  "status": "failed",
  "attempted_tiers": ["basic", "js", "stealth"],
  "tier_errors": {
    "basic": "http_404",
    "js": "navigation_failed",
    "stealth": "upstream_unsolved"
  }
}

Deferred record (runtime budget exhausted before the URL was attempted):

{
  "url": "https://example.com",
  "status": "deferred",
  "reason": "runtime_budget_exhausted",
  "attempted_tiers": []
}

Only success records trigger a fetch charge. failed and deferred are zero-charge.

The run's OUTPUT.json (visible in the run summary) is a small batch-level object:

{
  "batch_size": 50,
  "by_tier": { "basic": 41, "js": 7, "stealth": 1 },
  "failed": 1,
  "deferred": 0,
  "duration_ms": 28430,
  "runtime_budget_exhausted": false
}

Example

{
  "urls": [
    "https://example.com",
    "https://news.ycombinator.com",
    "https://www.python.org/"
  ],
  "outputs": ["markdown", "links", "og"]
}

Via the API:

curl -X POST "https://api.apify.com/v2/acts/shelvick~smart-page-fetcher/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "outputs": ["markdown"]}'

run-sync-get-dataset-items blocks until the run finishes and returns the dataset records directly. Use the async run endpoint for batches over ~50 URLs (Apify's sync endpoint has a 5-minute cap).

Calling from an AI agent

The Actor is designed for agent discovery and invocation.

Apify MCP server (mcp.apify.com): the Actor surfaces as a callable tool. The input schema is self-documenting, so an LLM can construct correct calls from the tool description without external context. Pay per call via the Actor's pay-per-event model — works with x402 and Skyfire agentic-payment rails.

Apify SDK (Python):

from apify_client import ApifyClient

client = ApifyClient(token=API_TOKEN)
run = client.actor("shelvick/smart-page-fetcher").call(
    run_input={"urls": ["https://example.com"], "outputs": ["markdown"]}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], item["status"], item.get("outputs", {}).get("markdown"))

REST API: /run-sync-get-dataset-items for batches that fit under 5 minutes, the async /runs endpoint for larger batches.

Pricing

Pay-per-event, billed only on success. Each URL is charged once at the tier that actually produced its content — basic, js, or stealth. failed and deferred URLs are free. A single Actor-start event is amortized across the whole batch. Higher tiers cost more because they involve more infrastructure (a real browser at tier 2, a real browser plus residential proxy at tier 3); on a typical mixed batch most URLs land on tier 1 and the effective per-URL cost is dominated by that floor.

See the Pricing tab on this Store page for the current per-event rates and any active subscriber discounts.

Behavior

Failure modes

The run itself is marked FAILED only on input validation problems:

urls empty or missing, with no url either
A URL doesn't start with http:// or https://
A urls object-form entry carries a header name outside the allowlist (see Per-URL request headers above), or an unknown key
All three tier flags set to false (no tier allowed)
screenshot or a11y requested but the tier chain excludes JS and stealth

Per-URL fetch failures don't fail the run — they land in the dataset as failed records with a tier_errors map. Common reasons:

http_404 / http_410 / http_5xx — target returned a terminal HTTP error
non_html_content_type: <type> — target returned binary, JSON, or PDF; we only handle HTML/XML
navigation_failed — browser couldn't reach the page (DNS, TLS, timeout)
upstream_unsolved — stealth tier couldn't bypass the target's bot defenses
upstream_policy_block — stealth backend refused the URL by policy (target on its denylist); not retried
upstream_rate_limited — backend hit a rate limit; retried internally before giving up

Performance expectations

Latency depends on the realized tier:

Tier 1: 100–500 ms per URL, fired in parallel up to 50 at a time
Tier 2: 3–10 s per URL, ~3 in parallel sharing one browser (3-5 s of browser startup amortized across the batch)
Tier 3: 15–90 s per URL, ~5 in parallel; the long tail is bot-defense challenge solving

A typical 50-URL mixed batch finishes in 30-90 seconds. Pure tier-1 batches finish in 2-5 seconds. The Actor's runtime budget defaults to 4 minutes 30 seconds (270000 ms) — raise it up to 60 minutes for very large batches.

Telemetry: to improve coverage and reliability, this Actor reports anonymous usage metrics and diagnostic events to the developer — run outcome counts, the sites queried, and, only when something goes wrong, the relevant input fields. No account identifiers are collected, and telemetry never affects a run.

FAQ

What if I know all my URLs need JavaScript? Set basic: "false". The chain starts at JS and skips the wasted tier-1 attempt.

What if I know my URLs are heavily defended? Set stealth: "true". The chain starts directly at the stealth tier — saves the cost of two failed lower tiers.

Can I cap costs at JS-tier price and refuse stealth? Yes: stealth: "false". URLs that fail JS will come back as failed records (zero charge) instead of escalating to the expensive tier.

What if my batch is too big to finish in 5 minutes? Raise runtime_budget_ms up to 3600000 (60 minutes) and use Apify's async run endpoint. Anything not finished before the budget runs out returns as deferred and you can retry just those.

Do I get charged if a fetch fails? No. Charges fire only after a success record is pushed to the dataset. Failed and deferred URLs cost you nothing per-URL — the actor-start fee is the only baseline.

Do screenshot and a11y work on tier 1? No — they require a real browser. Requesting them with js: "false" and stealth: "false" is a configuration error and the run will fail at validation time. With auto, the start tier is bumped automatically.

Can I crawl a site with this? Not directly. The Actor takes a list of URLs and fetches them; it does not follow links. Use the links output to feed a second invocation if you want a one-level crawl.

Is the returned HTML unmodified? Yes — byte-for-byte what the target server returned, with no scripts or wrappers injected by us or by the storage layer. Useful when feeding the HTML to a DOM parser, a diff tool, or an LLM that's particular about its inputs.

What this doesn't do

No authentication. Read-only, unauthenticated. Per-URL headers are limited to content-negotiation and polite-identification headers (Accept, Accept-Language, User-Agent, Referer, etc.); credential or session headers (Cookie, Authorization, X-*) are rejected at input validation. The fetch is anonymous from the target's perspective.
No forms or interactions. This is a fetcher, not a browser-automation tool. Use a dedicated Actor for clicking, scrolling, or form submission.
No automatic pagination. Pass the paginated URLs as a batch yourself.
No PDF or binary content. HTML/XML only. Non-HTML responses come back as failed with non_html_content_type.
No retries on terminal failures. The stealth tier retries internally on transient backend errors, but a 404 or unsolved challenge is final — we don't re-queue it.

For workflows that need any of these, this Actor is the right primitive to build on — call it from your own orchestrator and handle the higher-level loop there. For authenticated sessions and form-based interactions, use a dedicated browser-automation tool. For domain-aware crawling with built-in pagination or polite-crawling rate-limiting, a crawler specialized in multi-page traversal is the better fit. For sites with custom protected APIs, use the platform's own SDK instead.

Design notes: www.scotthelvick.com/tools/smart-page-fetcher

Website Content Scraper: Clean Markdown for AI and RAG

scrapemint/website-content-scraper

Crawl any website and get clean markdown, text, or HTML per page, ready for RAG pipelines, chatbots, and LLM fine tuning. Plain HTTP, no browser, no API key. Pay per page.

Ken M

Stealth Website Scraper | 💰$1.5 per 1,000 results

solutionssmart/stealth-website-scraper

Extract text, links, metadata, HTML, markdown, and structured page data with HTTP-first crawling and stealth-aware browser fallback.

Solutions Smart

AgentReader: Clean Web Content for AI Agents

f0rty7even/agent-reader

Turn any URL (HTML page or PDF) into clean, token-efficient Markdown with metadata, ready for an AI agent or LLM. Auto-detects PDFs and renders JavaScript-heavy pages when needed.

Michael Yousrie

URL to markdown

apify/url-to-markdown

An Apify Actor that takes a URL as input and returns the content of the page in Markdown format.

Apify

Web Page to Markdown Extractor

fetch_cat/web-page-to-markdown-extractor

Convert public URLs into clean Markdown, text, metadata, links, images, and optional HTML for AI and automation workflows.

Hanna Nosova

Web Page & PDF to Markdown

prizable_aster/web-page-pdf-to-markdown

Converts public web pages and text-based PDFs into clean Markdown, plain text, and structured JSON.

Vaque Wei

HTML to Markdown Converter — Clean conversion with batch

perryay/html-to-markdown

Clean, AI-ready Markdown from any HTML source. Converts web pages or raw HTML to well-structured Markdown — preserving headings, lists, tables, code blocks, links, and images. Clean mode strips ads and navigation. Batch convert up to 50 items via URL fetch or direct HTML input.

Perry AY

Website Content Crawler — Text, Markdown & HTML for AI/LLM

hichemdev/website-content-crawler

Crawl any website and extract clean text, Markdown, and HTML from every page — ready for LLM, RAG, and AI ingestion.

Hichem Ben Moussa

Website to Markdown Scraper

receptional_blender/website-to-markdown-scraper

Crawl any website and turn its pages into clean Markdown — plus optional plain text, raw HTML and full-page screenshots. Built for LLM, RAG and AI training datasets.