# Structured Data Extractor — URL to JSON (`shelvick/structured-extractor`) Actor

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

- **URL**: https://apify.com/shelvick/structured-extractor.md
- **Developed by:** [Scott Helvick](https://apify.com/shelvick) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $51.00 / 1,000 url extracted (base)s

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Structured Data Extractor

Extract structured data from a batch of URLs as schema-validated JSON. AI agents that scrape pages get raw HTML or markdown back, then burn their own tokens — and risk hallucinating — turning it into the fields they actually wanted. Structured Data Extractor closes that gap in one batch: send a list of URLs and one JSON Schema, and it scrapes them all in a single pass (escalating to a stealth browser and residential proxy when a page is defended), runs an LLM per page to convert each page to JSON matching your schema, validates conformance, and returns one clean structured record per URL — turning each URL into the exact fields you defined.

### What this does

- **Batch, one shared schema** — pass up to 100 URLs and a single `outputSchema`; the same shape is extracted from every page. Because all URLs are fetched in one pass, a stealth-browser/proxy launch is amortized across the batch instead of paid per page.
- **Schema-directed extraction** — each result is constrained to your schema, validated against it, and reported per URL via a `schemaValid` flag. If a page's first attempt doesn't conform, that page is retried once with the validation errors fed back.
- **Best-effort mode** — omit the schema and each page returns sensible inferred JSON.
- **Handles defended pages** — the fetch escalates automatically from plain HTTP to a real browser to a stealth + residential-proxy path. You don't pick a method.
- **Bounded cost and context** — `maxInputTokens` caps how much of each page reaches the model, and each content component is capped independently so one oversized field can't blow the budget or overflow the context window.
- **Time-budgeted** — `maxRuntimeSecs` (default 270s) keeps synchronous callers under the API's 5-minute cap; URLs not reached in time come back `deferred` and uncharged, so you can retry them.
- **One dataset record per URL** — `url`, `status`, `result`, `schemaValid`, `tokensUsed`, `inputTokens`, `error`.

Use cases:

- Extract `{title, price, in_stock}` from a list of bot-defended product pages as typed JSON, ready to insert into a database.
- Normalize a set of listing or article pages into one fixed schema your pipeline expects.
- Turn a crawl frontier (a page of result links) into structured records in one call.
- Structured web scraping into a fixed schema: pull the same fields from many JavaScript-heavy or bot-defended pages as typed JSON.
- Get schema-validated output per page that you can trust downstream instead of free-form model text.

### Why batch + schema-directed extraction matters

The common failure mode: an agent fetches a page, gets tens of thousands of tokens of markdown, and parses it itself. That burns tokens, can overflow the context window, and invites hallucinated values when data is sparse.

A subtler one: pages behind bot detection serve degraded content to suspected automation — different prices, missing inventory, placeholder text. An agent fetching with an ordinary client extracts data that looks correct but isn't.

And a practical one specific to defended pages: spinning up a stealth browser is expensive, so doing it once per URL — one run per page — wastes that setup. Most real extraction work is "the same fields from a list of similar pages," so this Actor takes the whole list at once and fetches it in a single pass, amortizing the browser and proxy launch across the batch.

The design answers all three: pages are fetched through a stealth path so what's extracted is the real page; the fetched content is capped to a per-URL token budget so cost stays bounded and scales with page size rather than spiking unpredictably; and the model output is constrained to your schema and *validated* against it — with a per-page retry that feeds the specific errors back — so you get typed, checked data instead of hopeful prose. When a field genuinely isn't on a page, the model is told to return null rather than invent one.

### How it compares to alternatives

| Approach | Stealth fetch | Structured to your schema | Conformance validated | Batch fetch amortization |
|---|---|---|---|---|
| Raw stealth fetcher | Yes | No — raw HTML/markdown | No | Depends |
| Model call on your own fetched HTML | No — you fetch | Yes | Usually not | No |
| Browser automation + hand-written selectors | Yes | Yes — you script it | Manual | You build it |
| **Structured Data Extractor** | Yes | Yes — JSON Schema in | Yes, per-page retry | Yes — one fetch per batch |

The raw-fetch and own-LLM approaches each solve half the problem; hand-written selectors solve both but cost ongoing maintenance. This Actor is the intersection — stealth fetch, schema-constrained extraction, conformance check — applied across a batch so the expensive fetch setup is paid once.

### Input

| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| `urls` | array | yes | -- | The pages to extract from (1–100). One shared `outputSchema` applies to all, so pass pages of the same kind. Fetched in one pass; stealth escalation is automatic and amortized across the batch. Public, unauthenticated pages only. |
| `outputSchema` | object | -- | -- | JSON Schema for the output shape, applied to **every** URL. Each result is validated; `schemaValid` reports per-URL conformance. Omit for best-effort extraction. Mark expected-but-optional fields nullable so the model returns null rather than guessing. |
| `maxInputTokens` | integer | -- | `32000` | Upper bound on fetched content fed to the model **per URL**. Bounds per-URL cost and prevents context overflow. Range 2000–200000. |
| `maxRuntimeSecs` | integer | -- | `270` | Soft wall-clock budget for the whole run. The default keeps synchronous/x402 callers under the 300s cap; URLs not reached by the deadline return `deferred` (uncharged). Range 60–600. |
| `proxyGeo` | string | -- | -- | ISO 3166-1 alpha-2 country code (e.g. `US`, `DE`) for residential routing. Leave empty for default routing; stealth escalation still applies. |

### Output

One dataset record per input URL.

Completed:

```json
{
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "status": "completed",
  "result": { "title": "A Light in the Attic", "price": "£51.77", "in_stock": true },
  "schemaValid": true,
  "tokensUsed": 4200,
  "inputTokens": 3400,
  "error": null
}
````

Failed (couldn't fetch or extract) and deferred (time budget exhausted) — neither is charged:

```json
{ "url": "https://example.com/blocked", "status": "failed", "result": {}, "schemaValid": true, "tokensUsed": 0, "inputTokens": 0, "error": "empty-content: failed" }
{ "url": "https://example.com/late", "status": "deferred", "result": {}, "schemaValid": true, "tokensUsed": 0, "inputTokens": 0, "error": "runtime-budget-exhausted" }
```

| Field | Type | Description |
|---|---|---|
| `url` | string | The input URL this record corresponds to. One record per input URL. |
| `status` | string | `completed` (extracted, charged), `failed` (couldn't fetch or extract; not charged), or `deferred` (time budget hit before this URL; not charged, retry it). |
| `result` | object | Extracted data. Conforms to `outputSchema` when provided (subject to `schemaValid`); best-effort otherwise. |
| `schemaValid` | boolean | Whether `result` validated against `outputSchema`. Always `true` when no schema was supplied. `false` means the model couldn't conform even after a retry. |
| `tokensUsed` | integer | LLM tokens consumed for this URL (input + output), across the extraction and any retry. |
| `inputTokens` | integer | Page-content tokens fed to the model for this URL (input only), across the extraction and any retry. This is the figure the per-1,000-token charge meters, so it explains the variable part of this URL's cost. |
| `error` | string | Reason when `status` is not `completed`; `null` on success. |

### Example

```json
{
  "urls": [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
  ],
  "outputSchema": {
    "type": "object",
    "properties": { "title": { "type": "string" }, "price": { "type": "string" } },
    "required": ["title", "price"]
  }
}
```

```bash
curl -X POST "https://api.apify.com/v2/acts/shelvick~structured-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls":["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html","https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"],"outputSchema":{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"string"}},"required":["title","price"]}}'
```

### Calling from an AI agent

#### Apify MCP server

The Actor is available as a callable tool on `mcp.apify.com`. The input schema is self-documenting — an LLM can construct a correct call from the tool description and field names alone. Pay per call via x402 USDC on Base or Skyfire managed tokens. Note the 300s synchronous cap: keep `maxRuntimeSecs` at its default for sync/x402 calls and let large batches `defer` the tail, or use the async path for big batches.

#### Apify SDK (Python)

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/structured-extractor").call(
    run_input={
        "urls": [
            "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        ],
        "outputSchema": {
            "type": "object",
            "properties": {"title": {"type": "string"}, "price": {"type": "string"}},
            "required": ["title", "price"],
        },
    }
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], item["status"], item["schemaValid"], item["result"])
```

#### REST API

```
POST https://api.apify.com/v2/acts/shelvick~structured-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN
```

For large batches, start asynchronously and poll:

```
POST https://api.apify.com/v2/acts/shelvick~structured-extractor/runs?token=YOUR_TOKEN
GET  https://api.apify.com/v2/actor-runs/{runId}/dataset/items?token=YOUR_TOKEN
```

### Pricing

Pay-per-event, billed only on success, and **metered to page size** so you pay for what each page actually costs to process. A successfully extracted URL carries two charges, both fired after that URL's record is pushed:

- a flat **per-URL base** — covers the page fetch, extraction setup, schema validation, and any per-page retry; and
- a **per-1,000-input-token** charge that scales with how much page content the model actually read (reported as `inputTokens` on each record).

Small pages cost less and large pages cost proportionally more, instead of every page paying one worst-case flat rate. `failed` and `deferred` URLs trigger neither charge, so a batch only ever costs you the URLs it actually extracted. Cap the variable part with `maxInputTokens`, and cap a whole run with `maxTotalChargeUsd`.

See the **Pricing** tab on this Store page for the current per-event rates and any active subscriber discounts.

### Behavior

**Run-level failures (rare):** invalid input fails the run before any work — empty `urls`, more than 100 URLs, or `maxInputTokens`/`maxRuntimeSecs` out of range. Nothing is charged.

**Per-URL outcomes:**

- `completed` — extracted. Check `schemaValid` for conformance to your schema.
- `failed` — the page couldn't be fetched (defended beyond stealth, blank, login wall) or the model returned unparseable output; `error` says which.
- `deferred` — the run's `maxRuntimeSecs` budget was exhausted before this URL was processed. Retry it (uncharged).
- `schemaValid: false` — extraction completed but didn't validate even after a retry; the best-effort result is still returned.

**Performance expectations:**

- One fetch pass covers the whole batch; a stealth-browser/proxy launch is shared across URLs rather than repeated.
- Cooperative pages: a few seconds each, fetched concurrently. Bot-defended pages add stealth latency.
- Extraction runs per URL; total time scales with batch size. At default `maxRuntimeSecs` (270s) a stealth batch of roughly 20 URLs completes; larger batches return the tail as `deferred`.
- Larger `maxInputTokens` increases per-URL latency and cost on big pages.

### FAQ

**Can the URLs have different output shapes?**
No — one `outputSchema` applies to the whole batch. Pass pages of the same kind (all products, all articles). For a different shape, use a separate run.

**Am I charged for URLs that fail or get deferred?**
No. The charge fires only per `completed` URL. `failed` and `deferred` URLs are free, so a batch costs only what it actually extracted.

**What's `deferred` and how do I handle it?**
The run hit its `maxRuntimeSecs` budget before reaching that URL. It's uncharged and retry-friendly — re-submit the deferred URLs, or raise `maxRuntimeSecs` (and use the async API) for bigger batches in one run.

**How do I keep cost down on very large pages?**
Lower `maxInputTokens`. It caps how much of each page reaches the model — the dominant cost — and each content component is capped independently so no single field can blow the budget.

**What does `schemaValid: false` mean?**
The model couldn't produce output conforming to your schema for that URL even after a retry. The result is still returned for inspection; simplify the schema or mark uncertain fields nullable.

### What this doesn't do

- **No authentication.** Public, unauthenticated pages only. It won't log in or submit credentials.
- **No per-URL schemas.** One shared `outputSchema` per run — pages should be the same kind.
- **No page interaction.** It doesn't click, fill forms, or navigate multi-step flows before extracting.
- **No crawling.** It extracts from the URLs you pass; it won't discover or follow links.
- **No CAPTCHA solving / no file parsing.** Interactive-CAPTCHA pages return `failed`; it extracts from web pages, not uploaded PDFs/images.

For raw page content (HTML or markdown) without an extraction step, use a page-fetching tool instead — this Actor adds an LLM extraction cost you don't need if you only want content. For clicking, form-filling, or authenticated sessions before extraction, use a browser-automation tool. For discovering links to extract, run a crawler first and pass its URLs here as a batch.

***

Design notes: [www.scotthelvick.com/tools/structured-extractor](https://www.scotthelvick.com/tools/structured-extractor/)

# Actor input Schema

## `urls` (type: `array`):

The web pages to extract structured data from. One shared output schema is applied to every URL, so pass pages of the same kind (all products, all articles, etc.). All URLs are fetched in one batch, so a stealth-browser/proxy launch is amortized across them instead of paid per page. Defended pages escalate to stealth + residential proxy automatically. Public, unauthenticated pages only. 1-100 URLs per run.

## `outputSchema` (type: `object`):

JSON Schema describing the output shape you want, applied to EVERY URL in the batch. The LLM is constrained to return data matching it; each result is validated and a per-URL `schemaValid` flag reports conformance. When omitted, the Actor returns best-effort JSON inferred from each page. Mark fields you expect but that may be missing as nullable so the LLM fills null rather than hallucinating.

## `maxInputTokens` (type: `integer`):

Upper bound on fetched page content fed to the LLM, per URL, in tokens. Bounds per-URL cost and prevents large pages from overflowing the context window. The fetched bundle (structured metadata plus a markdown rendering) is truncated to fit; each component is capped independently so no single field can consume the whole budget.

## `maxRuntimeSecs` (type: `integer`):

Soft wall-clock budget for the whole run, in seconds (minimum 60). The default 270 keeps synchronous callers under the 300s limit of the sync API and x402. URLs not reached before the deadline are returned with status 'deferred' (never charged), so you can retry them. Asynchronous callers can raise this for larger batches.

## `proxyGeo` (type: `string`):

Optional ISO 3166-1 alpha-2 country code (e.g. US, DE, GB) for residential proxy routing during the fetch. Use when target pages geo-restrict content or block datacenter IPs. Leave empty for default routing; stealth escalation still applies automatically.

## Actor input object example

```json
{
  "urls": [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
  ],
  "outputSchema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string"
      },
      "price": {
        "type": "string"
      },
      "in_stock": {
        "type": "boolean"
      }
    },
    "required": [
      "title",
      "price"
    ]
  },
  "maxInputTokens": 32000,
  "maxRuntimeSecs": 270
}
```

# Actor output Schema

## `results` (type: `string`):

Dataset items for this run (one extraction record).

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
    ],
    "outputSchema": {
        "type": "object",
        "properties": {
            "title": {
                "type": "string"
            },
            "price": {
                "type": "string"
            },
            "in_stock": {
                "type": "boolean"
            }
        },
        "required": [
            "title",
            "price"
        ]
    },
    "maxInputTokens": 32000,
    "maxRuntimeSecs": 270,
    "proxyGeo": ""
};

// Run the Actor and wait for it to finish
const run = await client.actor("shelvick/structured-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": [
        "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
        "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
    ],
    "outputSchema": {
        "type": "object",
        "properties": {
            "title": { "type": "string" },
            "price": { "type": "string" },
            "in_stock": { "type": "boolean" },
        },
        "required": [
            "title",
            "price",
        ],
    },
    "maxInputTokens": 32000,
    "maxRuntimeSecs": 270,
    "proxyGeo": "",
}

# Run the Actor and wait for it to finish
run = client.actor("shelvick/structured-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
  ],
  "outputSchema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string"
      },
      "price": {
        "type": "string"
      },
      "in_stock": {
        "type": "boolean"
      }
    },
    "required": [
      "title",
      "price"
    ]
  },
  "maxInputTokens": 32000,
  "maxRuntimeSecs": 270,
  "proxyGeo": ""
}' |
apify call shelvick/structured-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=shelvick/structured-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Structured Data Extractor — URL to JSON",
        "description": "Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.",
        "version": "0.0",
        "x-build-id": "eFps49sRC0EM0Ux8V"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/shelvick~structured-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-shelvick-structured-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/shelvick~structured-extractor/runs": {
            "post": {
                "operationId": "runs-sync-shelvick-structured-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/shelvick~structured-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-shelvick-structured-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs",
                        "type": "array",
                        "description": "The web pages to extract structured data from. One shared output schema is applied to every URL, so pass pages of the same kind (all products, all articles, etc.). All URLs are fetched in one batch, so a stealth-browser/proxy launch is amortized across them instead of paid per page. Defended pages escalate to stealth + residential proxy automatically. Public, unauthenticated pages only. 1-100 URLs per run.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "outputSchema": {
                        "title": "Output JSON Schema",
                        "type": "object",
                        "description": "JSON Schema describing the output shape you want, applied to EVERY URL in the batch. The LLM is constrained to return data matching it; each result is validated and a per-URL `schemaValid` flag reports conformance. When omitted, the Actor returns best-effort JSON inferred from each page. Mark fields you expect but that may be missing as nullable so the LLM fills null rather than hallucinating."
                    },
                    "maxInputTokens": {
                        "title": "Max input tokens (per URL)",
                        "minimum": 2000,
                        "maximum": 200000,
                        "type": "integer",
                        "description": "Upper bound on fetched page content fed to the LLM, per URL, in tokens. Bounds per-URL cost and prevents large pages from overflowing the context window. The fetched bundle (structured metadata plus a markdown rendering) is truncated to fit; each component is capped independently so no single field can consume the whole budget.",
                        "default": 32000
                    },
                    "maxRuntimeSecs": {
                        "title": "Max runtime (seconds)",
                        "minimum": 60,
                        "maximum": 600,
                        "type": "integer",
                        "description": "Soft wall-clock budget for the whole run, in seconds (minimum 60). The default 270 keeps synchronous callers under the 300s limit of the sync API and x402. URLs not reached before the deadline are returned with status 'deferred' (never charged), so you can retry them. Asynchronous callers can raise this for larger batches.",
                        "default": 270
                    },
                    "proxyGeo": {
                        "title": "Proxy country",
                        "type": "string",
                        "description": "Optional ISO 3166-1 alpha-2 country code (e.g. US, DE, GB) for residential proxy routing during the fetch. Use when target pages geo-restrict content or block datacenter IPs. Leave empty for default routing; stealth escalation still applies automatically."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
