# Wayback Machine CDX Bulk Extractor (`automation-lab/wayback-machine-cdx-extractor`) Actor

Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.

- **URL**: https://apify.com/automation-lab/wayback-machine-cdx-extractor.md
- **Developed by:** [Stas Persiianenko](https://apify.com/automation-lab) (community)
- **Categories:** SEO tools, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Wayback Machine CDX Bulk Extractor

### What does it do?

**Wayback Machine CDX Bulk Extractor** uses the Internet Archive's CDX (Capture Index) API to extract complete snapshot metadata for any domain, URL, or wildcard pattern. For every archived page the Wayback Machine has ever crawled, you get the timestamp, HTTP status code, MIME type, content digest, file size, and a direct replay link — all exported to a structured dataset in seconds.

Unlike manually browsing https://web.archive.org/, this actor programmatically paginates through millions of CDX records, applies server-side and client-side filters, and exports clean, structured data at scale.

---

### Who is it for?

👩‍💻 **SEO professionals & digital marketers** — audit historical URL structures, find old redirects, identify pages that returned 404 errors over time, and recover lost link equity.

🔍 **Web archivists & researchers** — build comprehensive inventories of how a website evolved, what pages existed at what timestamps, and which content versions were captured.

🛡️ **Security analysts** — discover exposed endpoints, track historical subdomain activity, or detect when sensitive paths were briefly indexed.

📊 **Data journalists & OSINT investigators** — reconstruct a site's history, verify when specific pages first appeared, or find crawl evidence of content that has since been removed.

🧑‍💻 **Developers & QA engineers** — validate archive coverage for migration projects, check historical status code patterns, or build link-checking tools with historical context.

---

### Why use it?

- ✅ **No scraping restrictions** — the CDX API is public, free, and built for bulk access
- ✅ **Handles millions of records** — automatic pagination via resumeKey with no manual intervention
- ✅ **Flexible filtering** — narrow by date range, HTTP status codes, MIME types, or collapse duplicates by URL/content
- ✅ **Zero proxy cost** — Internet Archive's CDX API requires no proxies, so every run is extremely cheap
- ✅ **Full Wayback Machine replay URLs** — each record includes a direct link to view the archived snapshot
- ✅ **Domain-wide coverage** — a single input query can retrieve snapshots for all subdomains

---

### What data does it extract?

Each snapshot record in the output dataset contains:

| Field | Type | Description |
|-------|------|-------------|
| `urlKey` | string | Canonical URL key in SURT format (e.g., `com,example)/path`) |
| `timestamp` | string | Capture timestamp in YYYYMMDDHHmmss format |
| `originalUrl` | string | The original URL as crawled |
| `mimeType` | string | MIME type of the captured content (e.g., `text/html`, `application/pdf`) |
| `statusCode` | string | HTTP status code at capture time (e.g., `200`, `301`, `404`) |
| `digest` | string | SHA-1 content digest for deduplication |
| `length` | number | Compressed size of the stored WARC record in bytes |
| `waybackUrl` | string | Full Wayback Machine replay URL (when enabled) |

---

### How much does it cost to extract Wayback Machine snapshots?

This actor uses **pay-per-event (PPE) pricing** — you only pay for the snapshots you extract.

- **Start event**: $0.005 (one-time per run)
- **Per snapshot**: $0.000046 (FREE tier) — effectively **$0.046 per 1,000 snapshots**

**Example costs:**
| Snapshot count | Estimated cost |
|----------------|---------------|
| 1,000 | ~$0.051 |
| 10,000 | ~$0.465 |
| 100,000 | ~$4.605 |
| 1,000,000 | ~$46.005 |

Because the CDX API is public and no proxy is used, this actor has near-zero infrastructure cost. Most of what you pay goes directly toward supporting the service. Higher subscription tiers (BRONZE through DIAMOND) get significant per-snapshot discounts.

---

### How to use it

#### Step 1 — Enter your target URL

Type a domain, exact URL, or wildcard pattern in the **URL or domain** field:

- `example.com` — all pages on example.com (use with `domain` matchType)
- `https://example.com/blog/` — all blog pages (use with `prefix` matchType)
- `*.example.com/*` — all subdomains (use with `domain` matchType)

#### Step 2 — Choose a match type

| Match Type | What it returns |
|------------|----------------|
| `exact` | Only the exact URL specified |
| `prefix` | All URLs that start with the given URL |
| `host` | All URLs on the same hostname |
| `domain` | All URLs on the host AND all of its subdomains |

#### Step 3 — Set limits and filters (optional)

- **Max snapshots** — cap the total records extracted (0 = unlimited)
- **From/To date** — narrow to a specific time window (YYYYMMDD format)
- **Filter status codes** — include only specific HTTP codes (e.g., `[200, 301]`)
- **Exclude status codes** — remove specific HTTP codes (e.g., `[404, 500]`)
- **Filter MIME types** — include only specific content types (e.g., `["text/html"]`)
- **Collapse** — deduplicate by URL key, content digest, year, month, or day

#### Step 4 — Run and export

Click **Run** and wait for extraction to complete. Export results as JSON, CSV, XML, or Excel directly from the dataset tab.

---

### Input schema

```json
{
    "url": "example.com",
    "matchType": "domain",
    "maxSnapshots": 1000,
    "fromDate": "20200101",
    "toDate": "20231231",
    "filterStatusCodes": [200],
    "excludeStatusCodes": [404],
    "filterMimeTypes": ["text/html"],
    "pageSize": 10000,
    "collapse": "urlkey",
    "outputWaybackUrl": true
}
````

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | URL, domain, or wildcard to query |
| `matchType` | enum | `domain` | URL matching strategy |
| `maxSnapshots` | integer | 1000 | Max records (0 = unlimited) |
| `fromDate` | string | — | Start date YYYYMMDD |
| `toDate` | string | — | End date YYYYMMDD |
| `filterStatusCodes` | array | `[]` | Include only these HTTP codes |
| `excludeStatusCodes` | array | `[]` | Exclude these HTTP codes |
| `filterMimeTypes` | array | `[]` | Include only these MIME types |
| `pageSize` | integer | 10000 | Records per CDX API page |
| `collapse` | enum | — | Deduplication strategy |
| `outputWaybackUrl` | boolean | `true` | Include Wayback Machine replay URL |

***

### Output

Sample output record:

```json
{
    "urlKey": "com,example)/",
    "timestamp": "20230415120000",
    "originalUrl": "https://example.com/",
    "mimeType": "text/html",
    "statusCode": "200",
    "digest": "JI6OR3QR4CI526JD6TMMNZNV4QPMPQCH",
    "length": 1248,
    "waybackUrl": "https://web.archive.org/web/20230415120000/https://example.com/"
}
```

***

### Tips & tricks

💡 **Use `collapse=urlkey` to find all unique URLs** — this returns only the first capture per URL, giving you a clean list of unique pages the Wayback Machine ever visited.

💡 **Use `collapse=digest` to find unique content versions** — skip duplicate captures that archived the same byte-identical content.

💡 **Use `matchType=domain` for full subdomain coverage** — this is the broadest option and will include `www.example.com`, `blog.example.com`, etc.

💡 **Use date filters for historical analysis** — narrow to a specific year to audit what a site looked like in that period.

💡 **Filter `statusCode=200` for live content only** — remove redirects, errors, and crawl artefacts to focus on successful captures.

💡 **CDX API notes:**

- The CDX API sometimes returns `warc/revisit` MIME type for records where only HTTP headers were re-crawled (not the full content). Use `filterMimeTypes: ["text/html"]` to exclude these.
- Status code `-` in CDX output means the capture type is a revisit (no real HTTP response).
- Large domains (e.g., major news sites) can have tens of millions of snapshots — set a reasonable `maxSnapshots` to avoid very long runs.

***

### Integrations

#### Export to Google Sheets

After the run, click **Export → Google Sheets** in the dataset view. Use the data to build URL history timelines, pivot tables by status code over time, or visualize crawl density by year.

#### Combine with SEO tools

Export the snapshot list as CSV and import into Screaming Frog or Ahrefs to cross-reference current URL statuses against historical captures — a powerful way to identify redirect chains and 404 link equity leaks.

#### Archive monitoring workflow

Schedule this actor to run weekly on a specific domain and compare new captures to a previous run. Any new `statusCode=404` entries indicate recently broken pages. Connect to Google Sheets or a webhook to get automated alerts.

#### Automated redirects audit

Run with `filterStatusCodes: [301, 302]` and export the list of all historical redirects on a domain. Cross-reference with your current redirect rules to find redirect chains or outdated configurations.

***

### API usage

#### Node.js

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });

const run = await client.actor('automation-lab/wayback-machine-cdx-extractor').call({
    url: 'example.com',
    matchType: 'domain',
    maxSnapshots: 5000,
    filterStatusCodes: [200],
    collapse: 'urlkey',
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`Extracted ${items.length} unique URLs`);
```

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")

run = client.actor("automation-lab/wayback-machine-cdx-extractor").call(run_input={
    "url": "example.com",
    "matchType": "domain",
    "maxSnapshots": 5000,
    "filterStatusCodes": [200],
    "collapse": "urlkey",
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Extracted {len(items)} unique URLs")
```

#### cURL

```bash
curl -X POST 'https://api.apify.com/v2/acts/automation-lab~wayback-machine-cdx-extractor/runs' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_TOKEN' \
  -d '{
    "url": "example.com",
    "matchType": "domain",
    "maxSnapshots": 1000
  }'
```

***

### Use with AI agents via MCP

Wayback Machine CDX Extractor is available as a tool for AI assistants that support the [Model Context Protocol (MCP)](https://docs.apify.com/platform/integrations/mcp).

#### Setup for Claude Code

```bash
claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"
```

#### Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

```json
{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/wayback-machine-cdx-extractor"
        }
    }
}
```

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

#### Example prompts

Once connected, try asking your AI assistant:

- "Extract all snapshots of example.com from 2020 with status code 200"
- "Get the unique URLs archived for blog.example.com using collapse by urlkey"
- "Find all 404 pages archived for example.com in the last 5 years"

Learn more in the [Apify MCP documentation](https://docs.apify.com/platform/integrations/mcp).

***

### Is it legal to use the Wayback Machine CDX API?

Yes. The Internet Archive CDX API is a **public API** explicitly provided by the Internet Archive for programmatic access to its index data. It is documented and freely accessible at https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.

This actor does not bypass any authentication or rate limiting mechanisms. It accesses only the public CDX search endpoint, which is designed for exactly this type of bulk query. The Internet Archive actively encourages researchers, archivists, and developers to use their APIs.

***

### FAQ

#### How many snapshots can I extract?

There is no hard limit — set `maxSnapshots` to `0` for unlimited extraction. Major domains like news sites or social networks may have tens of millions of snapshots. For performance, the actor paginates automatically using the CDX resumeKey mechanism.

#### Why do some records show status code `-`?

The CDX API uses `-` for "revisit" records, where the Wayback Machine re-crawled a page but only stored a reference to a previous capture (because the content was identical). These are real crawl events but don't have a traditional HTTP response code. Filter them out with `excludeStatusCodes: [-1]` or use `filterStatusCodes: [200]` to get only real successful captures.

#### Why are some MIME types `warc/revisit`?

Same as above — revisit records use `warc/revisit` as their MIME type. Use `filterMimeTypes: ["text/html"]` to exclude these if you only want full content captures.

#### The API returned 503 errors during my run. What happened?

The Internet Archive CDX API occasionally returns 503 errors under load. This actor automatically retries up to 3 times with exponential backoff before failing. If you consistently get 503s, try reducing `pageSize` from 10000 to 1000.

#### How do I get all unique URLs (not every snapshot)?

Use `collapse: "urlkey"` — this returns only the first capture per unique URL, giving you a clean inventory of every URL the Wayback Machine ever crawled on your domain.

#### Can I use wildcard patterns?

Yes. Enter patterns like `*.example.com/*` as the URL with `matchType: "domain"` to match all subdomains and paths.

***

### Related actors

- [Broken Link Checker](https://apify.com/automation-lab/broken-link-checker) — find broken links on live websites
- [Canonical URL Checker](https://apify.com/automation-lab/canonical-url-checker) — validate canonical tags and redirect chains
- [AAAA Record Checker](https://apify.com/automation-lab/aaaa-record-checker) — bulk DNS lookup for IPv6 records

# Actor input Schema

## `url` (type: `string`):

The URL, domain, or wildcard pattern to query. Examples: 'example.com', 'https://example.com/blog/*', '*.example.com/\*'. For full domain coverage use domain matchType.

## `matchType` (type: `string`):

Controls URL matching: 'exact' matches only the exact URL, 'prefix' matches all URLs starting with the given URL, 'host' matches all URLs on the same host, 'domain' matches the host and all subdomains.

## `maxSnapshots` (type: `integer`):

Maximum number of snapshots to extract. Set to 0 for unlimited (extract all available snapshots). Default: 1000.

## `fromDate` (type: `string`):

Start date for filtering snapshots (inclusive). Format: YYYYMMDD or YYYYMMDDHHMMSS. Leave empty for no start filter.

## `toDate` (type: `string`):

End date for filtering snapshots (inclusive). Format: YYYYMMDD or YYYYMMDDHHMMSS. Leave empty for no end filter.

## `filterStatusCodes` (type: `array`):

Only include snapshots with these HTTP status codes. Example: \[200, 301]. Leave empty to include all status codes.

## `excludeStatusCodes` (type: `array`):

Exclude snapshots with these HTTP status codes. Example: \[404, 500]. Applied after filterStatusCodes.

## `filterMimeTypes` (type: `array`):

Only include snapshots with these MIME types. Example: \['text/html', 'application/pdf']. Leave empty to include all MIME types.

## `pageSize` (type: `integer`):

Number of records per CDX API request. Higher values reduce API calls but may time out for dense domains. Default: 10000.

## `collapse` (type: `string`):

Collapse consecutive records with the same value for a given field. 'urlkey' deduplicates by unique URL, 'digest' deduplicates by identical content. Leave empty to get all snapshots.

## `outputWaybackUrl` (type: `boolean`):

Add a waybackUrl field with the full Wayback Machine replay URL for each snapshot (https://web.archive.org/web/{timestamp}/{url}).

## Actor input object example

```json
{
  "url": "example.com",
  "matchType": "domain",
  "maxSnapshots": 1000,
  "filterStatusCodes": [],
  "excludeStatusCodes": [],
  "filterMimeTypes": [],
  "pageSize": 10000,
  "collapse": "",
  "outputWaybackUrl": true
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "example.com",
    "maxSnapshots": 1000
};

// Run the Actor and wait for it to finish
const run = await client.actor("automation-lab/wayback-machine-cdx-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "url": "example.com",
    "maxSnapshots": 1000,
}

# Run the Actor and wait for it to finish
run = client.actor("automation-lab/wayback-machine-cdx-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "example.com",
  "maxSnapshots": 1000
}' |
apify call automation-lab/wayback-machine-cdx-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=automation-lab/wayback-machine-cdx-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Wayback Machine CDX Bulk Extractor",
        "description": "Bulk extract archived snapshot metadata from the Wayback Machine CDX API. Get every crawled URL, timestamp, HTTP status code, MIME type, and content digest for any domain or URL pattern. Export to JSON, CSV, or Excel.",
        "version": "0.1",
        "x-build-id": "WKXcwoq64otakx5Sd"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/automation-lab~wayback-machine-cdx-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-automation-lab-wayback-machine-cdx-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/automation-lab~wayback-machine-cdx-extractor/runs": {
            "post": {
                "operationId": "runs-sync-automation-lab-wayback-machine-cdx-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/automation-lab~wayback-machine-cdx-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-automation-lab-wayback-machine-cdx-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "url": {
                        "title": "URL or domain",
                        "type": "string",
                        "description": "The URL, domain, or wildcard pattern to query. Examples: 'example.com', 'https://example.com/blog/*', '*.example.com/*'. For full domain coverage use domain matchType."
                    },
                    "matchType": {
                        "title": "Match type",
                        "enum": [
                            "exact",
                            "prefix",
                            "host",
                            "domain"
                        ],
                        "type": "string",
                        "description": "Controls URL matching: 'exact' matches only the exact URL, 'prefix' matches all URLs starting with the given URL, 'host' matches all URLs on the same host, 'domain' matches the host and all subdomains.",
                        "default": "domain"
                    },
                    "maxSnapshots": {
                        "title": "Max snapshots",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of snapshots to extract. Set to 0 for unlimited (extract all available snapshots). Default: 1000.",
                        "default": 1000
                    },
                    "fromDate": {
                        "title": "From date (YYYYMMDD)",
                        "pattern": "^(\\d{8}(\\d{6})?)?$",
                        "type": "string",
                        "description": "Start date for filtering snapshots (inclusive). Format: YYYYMMDD or YYYYMMDDHHMMSS. Leave empty for no start filter."
                    },
                    "toDate": {
                        "title": "To date (YYYYMMDD)",
                        "pattern": "^(\\d{8}(\\d{6})?)?$",
                        "type": "string",
                        "description": "End date for filtering snapshots (inclusive). Format: YYYYMMDD or YYYYMMDDHHMMSS. Leave empty for no end filter."
                    },
                    "filterStatusCodes": {
                        "title": "Filter by status codes",
                        "type": "array",
                        "description": "Only include snapshots with these HTTP status codes. Example: [200, 301]. Leave empty to include all status codes.",
                        "default": []
                    },
                    "excludeStatusCodes": {
                        "title": "Exclude status codes",
                        "type": "array",
                        "description": "Exclude snapshots with these HTTP status codes. Example: [404, 500]. Applied after filterStatusCodes.",
                        "default": []
                    },
                    "filterMimeTypes": {
                        "title": "Filter by MIME types",
                        "type": "array",
                        "description": "Only include snapshots with these MIME types. Example: ['text/html', 'application/pdf']. Leave empty to include all MIME types.",
                        "default": []
                    },
                    "pageSize": {
                        "title": "Page size",
                        "minimum": 100,
                        "maximum": 150000,
                        "type": "integer",
                        "description": "Number of records per CDX API request. Higher values reduce API calls but may time out for dense domains. Default: 10000.",
                        "default": 10000
                    },
                    "collapse": {
                        "title": "Collapse duplicates",
                        "enum": [
                            "",
                            "urlkey",
                            "digest",
                            "timestamp:4",
                            "timestamp:6",
                            "timestamp:8"
                        ],
                        "type": "string",
                        "description": "Collapse consecutive records with the same value for a given field. 'urlkey' deduplicates by unique URL, 'digest' deduplicates by identical content. Leave empty to get all snapshots.",
                        "default": ""
                    },
                    "outputWaybackUrl": {
                        "title": "Include Wayback Machine URL",
                        "type": "boolean",
                        "description": "Add a waybackUrl field with the full Wayback Machine replay URL for each snapshot (https://web.archive.org/web/{timestamp}/{url}).",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
