# Deduplicate, Merge & Transform Datasets (`datacach/deduplicate-datasets`) Actor

Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.

- **URL**: https://apify.com/datacach/deduplicate-datasets.md
- **Developed by:** [DataCach](https://apify.com/datacach) (community)
- **Categories:** Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.01 / 1,000 processed items

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

**Merge, deduplicate, and transform Apify datasets in one fast pass.** The **Merge, Dedup & Transform Datasets** Actor lets you **combine multiple datasets into one**, **remove duplicate items** by any combination of fields, and optionally **reshape your data with Python** — then save the clean result back to a [dataset](https://docs.apify.com/platform/storage/dataset) or key-value store. It's powered by **[Polars](https://pola.rs/)**, so it **deduplicates millions of records** quickly while keeping memory usage low.

**No code is required for basic merging and deduplication** — just paste your dataset IDs, choose the field(s) that make an item unique, and click **Run**. New here? Leave the input empty to try it instantly on **built-in sample data**. Whether you run parallel scrapers that return overlapping results, append to the same dataset every day, or just need a single tidy export, this **dataset deduplication tool** turns a messy pile of records into **unique, ready-to-use data**.

<!-- Tip: add a screenshot of the Input tab and/or a short YouTube walkthrough here. Apify Console auto-renders a YouTube URL placed on its own line, and pages with video tend to rank higher. -->

### What does the Merge, Dedup & Transform Datasets Actor do?

This Actor is a **data-processing utility** (not a web scraper) for the [Apify](https://apify.com) platform. In a single run it can:

- 🗂️ **Merge datasets** — concatenate any number of Apify datasets into one stream, in the order you list them.
- 🧹 **Deduplicate items** — keep only the first unique occurrence based on the **fields you choose** (dedupe by `url`, by `id`, by a `name` + `id` combination, or any fields you like).
- 🔎 **Find only new items** — feed yesterday's dataset as a filter so you output only records you **haven't seen before**.
- ✏️ **Transform records** — filter, enrich, or restructure items with custom **Python** functions before and/or after deduplication.
- 📤 **Output flexibly** — return the **unique items**, the **removed duplicates**, or just the **duplicate counts**, to a dataset or key-value store.
- ⚡ **Scale to 10M+ items** — stream data in batches with near-constant memory for very large jobs.

Because it runs on **Apify**, you also get everything the platform adds on top: **schedule** recurring deduplication, trigger runs through the **[Apify API](https://docs.apify.com/api/v2)** or **integrations** (Make, Zapier, n8n, webhooks), chain this Actor right after your scrapers, monitor runs, and download the clean data in **JSON, CSV, Excel, or HTML**.

### Why use this dataset deduplication tool?

Duplicate and scattered records are one of the most common headaches in web scraping and data collection. This Actor solves them without writing a single loop:

- **Clean up scraper output** — remove duplicate products, listings, profiles, or URLs that the same crawl returned more than once.
- **Merge results from parallel runs** — consolidate dozens of scraper runs (or every run of an Actor/Task) into **one deduplicated dataset**.
- **Build a "new items only" feed** — diff today's data against yesterday's so downstream steps only process fresh records.
- **Deduplicate before export or import** — hand clean, unique data to your database, spreadsheet, CRM, or BI tool.
- **Audit duplicates fast** — get unique/duplicate counts without exporting anything.

It's built for both **no-code users** (point-and-click merging and deduplication) and **developers** (optional Python transforms and fine-grained performance controls).

### How to deduplicate and merge datasets (step-by-step)

1. **Add your Dataset IDs.** Paste one or more Apify dataset IDs into the **Dataset IDs** field — or point the Actor at an **Actor/Task ID** to automatically pull all of its run datasets.
2. **Choose your deduplication fields.** List the field(s) that make an item unique — e.g. `["url"]` or `["name", "id"]`. Leave it empty to **merge without deduplicating**.
3. **Pick what to output.** Keep the default **Unique items**, or switch to **Duplicate items** or **Nothing** (count only).
4. **(Optional) Transform your data.** Add a Python expression to filter or reshape items before/after deduplication.
5. **Click Run.** When the run finishes, open the **Output** tab to preview and download your data in **JSON, CSV, Excel, or HTML**.

> 💡 **Want to try it first?** Run the Actor with an empty input. It processes a small built-in sample dataset so you can see exactly how merging and deduplication work before pointing it at your own data.

### Input

Configure everything in the **Input** tab — no code required for the basics. The two most important fields are **Dataset IDs** and **Deduplication fields**; everything else is optional with sensible defaults. Click the **Input** tab for the full list of options and tooltips.

A minimal input looks like this:

```json
{
  "datasetIds": ["dHZ7Xy9aBc...", "p9Kq2A4nMe..."],
  "fields": ["name", "id"],
  "output": "unique-items",
  "appendDatasetIds": true
}
````

Useful optional inputs include **Pre/Post-deduplication transform (Python)**, **Actor or Task ID** (with **Only runs newer/older than** date filters), **Dataset IDs of filter items** (to output only new records), **Output destination** (dataset or key-value store), **Deduplication mode**, and **limits & performance** controls (`offset`, `limit`, `fieldsToLoad`, `parallelLoads`, `parallelPushes`, `uploadBatchSize`, `batchSizeLoad`).

### Output

Output is **pass-through**: the unique (or duplicate) items keep **all the fields** from your source datasets. You can download the dataset extracted by this Actor in various formats such as **JSON, HTML, CSV, or Excel**, or fetch it via the API. If you choose **key-value store** output, the result is written to the `OUTPUT` record instead.

For input fields `["name", "id"]`, the Actor keeps the **first** record for every unique `name` + `id` and drops the rest:

```json
[
  { "name": "Adidas Shoes", "id": 12345, "price": 100, "__datasetId__": "dHZ7Xy9aBc..." },
  { "name": "Nike Air",     "id": 1,     "price": 50,  "__datasetId__": "dHZ7Xy9aBc..." },
  { "name": "Puma Suede",   "id": 2,     "price": 30,  "__datasetId__": "p9Kq2A4nMe..." }
]
```

#### What data does this Actor output?

| Field | Type | Description |
|-------|------|-------------|
| *(your fields)* | any | Every field from your source datasets is preserved unchanged |
| `__datasetId__` | string | ID of the source dataset each item came from — added only when **Append dataset IDs** is enabled |

### How dataset deduplication works under the hood

For every item, the Actor builds a **deduplication key** by JSON-stringifying each of your chosen fields and concatenating them — so `["name", "id"]` with `name = "Adidas Shoes"` and `id = 12345` becomes the key `"Adidas Shoes"12345`. Objects and arrays are **deep-compared** (object keys are sorted), so field order never causes false mismatches. The Actor keeps the **first** item it sees for each key and treats later matches as duplicates. With **Treat null fields as unique** enabled, items whose key field is `null` or missing are always kept.

Two deduplication modes let you trade speed for memory:

- **Dedup after load** (default) — loads everything, then deduplicates. Fastest for typical datasets.
- **Dedup as loading** — deduplicates batch-by-batch with near-constant memory, ideal for **10M+ item** jobs.

### How much does it cost to deduplicate datasets?

This Actor is billed by **Apify platform usage** (the compute resources a run consumes), so cost scales with **how many items you process**. Because the heavy lifting runs on **Polars** with parallel loading and pushing, throughput is high and most deduplication jobs finish in a fraction of the time of naive loops. To estimate your cost, run it once on a representative dataset and check the run's usage, then scale roughly linearly. New Apify accounts include **free monthly usage credits**, which are enough to test and run small-to-medium jobs at no cost.

### Tips for faster, cheaper deduplication

- **Load only what you need.** Set **Fields to load** to just your deduplication fields when you only need counts or a subset — less data to transfer and process.
- **Use `dedup-as-loading` for huge datasets.** It streams in batches and keeps memory near-constant for **10M+ item** jobs.
- **Tune concurrency.** Raise **Parallel loads** / **Parallel pushes** for faster runs on large datasets, or lower them to be gentle on resources.
- **Audit before exporting.** Use **Output → Nothing** to get unique/duplicate counts without writing a full dataset.

### Related Actors

- Pair this Actor with any **[Apify Store scraper](https://apify.com/store)** to automatically deduplicate and merge its results as a final clean-up step.

### FAQ

#### Can I deduplicate by more than one field?

Yes. List several fields (e.g. `["name", "id"]`) and an item is considered a duplicate only when **all** of those field values match a previous item.

#### How do I keep only items I haven't seen before?

Put your previous/known datasets in **Dataset IDs of filter items**. Their keys seed the "already seen" set, so matching records in your new datasets are dropped and only **new unique items** are output. The filter items themselves are never added to the output.

#### Can I merge datasets without removing duplicates?

Yes. Leave the **Deduplication fields** empty and the Actor will simply **merge** all your datasets into one, in the order you list them.

#### Does it work with millions of items?

Yes. Switch **Deduplication mode** to `dedup-as-loading` to stream items in batches with near-constant memory, suitable for **10M+ records**.

#### Can I transform the data while deduplicating?

Yes. Add a **Pre-deduplication** and/or **Post-deduplication transform** — short Python that receives the `items` list (plus your optional **Custom input data**) and returns a new list. Use it to filter, rename fields, or enrich records.

#### Can I run it on a schedule or via the API?

Absolutely. Use Apify **Schedules** to run it automatically, or call it with the **[Apify API](https://docs.apify.com/api/v2)**, CLI, or an integration (Make, Zapier, n8n). It's ideal as a final "clean-up" step after your scraping Actors.

#### What happens to fields that aren't deduplication keys?

They're preserved exactly as-is. Deduplication only uses your chosen fields to decide uniqueness; the full record is passed through unchanged.

### Support, feedback, and disclaimers

Found a bug or have a feature request? Open an issue on the Actor's **Issues** tab — feedback is welcome and helps improve the tool. Need a custom data-processing workflow? Reach out for a tailored solution.

This Actor processes data **you already own or are authorized to use** in your Apify account; it does **not** scrape third-party websites. Always make sure your use of the underlying data complies with the relevant terms of service and data-protection regulations such as the GDPR.

# Actor input Schema

## `datasetIds` (type: `array`):

One or more Apify dataset IDs (e.g. <code>dHZ7Xy9aBc...</code>) or dataset names to merge and deduplicate. Items are concatenated in the order listed, which decides which duplicate is kept — the first occurrence wins. Leave this empty (and <b>Actor or Task ID</b> empty too) to run on built-in sample data for a quick first test.

## `fields` (type: `array`):

The field(s) whose combined value makes an item unique — e.g. <code>\["url"]</code> or <code>\["name", "id"]</code>. Each value is JSON-stringified and concatenated into a single key, and the first item seen for each key is kept. Leave empty to merge datasets without removing any duplicates.

## `output` (type: `string`):

Choose what the run returns: the <b>unique</b> items, only the <b>duplicate</b> items that were removed, or <b>nothing</b> (just the unique/duplicate counts — handy for a quick duplicate audit).

## `mode` (type: `string`):

<b>Dedup after load</b> loads everything into memory, then deduplicates — fastest for typical datasets. <b>Dedup as loading</b> deduplicates batch-by-batch, keeping memory near-constant for very large (10M+) datasets.

## `nullAsUnique` (type: `boolean`):

When on, items whose deduplication field(s) are null or missing are always kept as unique and never removed. When off, a null/missing value is treated like any other value when matching duplicates.

## `actorOrTaskId` (type: `string`):

Optionally pull source datasets from every run of an Actor or Task. Enter an ID or full name (e.g. <code>apify/web-scraper</code>); the default dataset of each successful run is merged together with anything in <b>Dataset IDs</b>.

## `onlyRunsNewerThan` (type: `string`):

When using <b>Actor or Task ID</b>, only include runs that finished on or after this date. Accepts a date (<code>2024-01-31</code>) or a full ISO datetime.

## `onlyRunsOlderThan` (type: `string`):

When using <b>Actor or Task ID</b>, only include runs that finished on or before this date. Accepts a date (<code>2024-01-31</code>) or a full ISO datetime.

## `datasetIdsOfFilterItems` (type: `array`):

Datasets used only to pre-seed the 'already seen' keys. Items in <b>Dataset IDs</b> that match these keys are dropped as duplicates, but the filter items themselves are never output — perfect for outputting only records you haven't seen before.

## `preDedupTransformFunction` (type: `string`):

Optional Python that receives <code>items</code> (a list of dicts) plus <code>input\_data</code>/<code>customInputData</code> and returns a new list of dicts. Runs on each loaded batch <b>before</b> deduplication, so you can filter or add items. Example: <code>\[i for i in items if i.get('price')]</code>.

## `postDedupTransformFunction` (type: `string`):

Optional Python that receives <code>items</code> (a list of dicts) and returns a new list of dicts. Runs <b>after</b> deduplication, just before the result is written — ideal for trimming or renaming fields.

## `customInputData` (type: `object`):

Arbitrary JSON passed straight to your transform functions as <code>input\_data</code> / <code>customInputData</code>. Use it to parameterize the transforms without editing their code.

## `outputTo` (type: `string`):

Where to store the result: a <b>dataset</b> (downloadable as JSON, CSV, Excel, or HTML) or the run's <b>key-value store</b> under the <code>OUTPUT</code> key.

## `outputDatasetId` (type: `string`):

Dataset ID or name to push the output to. If a name doesn't exist yet, a named dataset is created. Leave empty to use this run's default dataset.

## `appendDatasetIds` (type: `boolean`):

When on, every output item gets a <code>**datasetId**</code> field recording which source dataset it came from. When off, items are passed through unchanged.

## `fieldsToLoad` (type: `array`):

Load only these fields from the source datasets to save memory and speed up loading — for example, just your deduplication fields when you only need counts. Leave empty to load every field.

## `offset` (type: `integer`):

Skip this many items from the start of the merged input before processing. Leave empty to start from the first item.

## `limit` (type: `integer`):

Process at most this many items from the merged input. Leave empty to process all of them.

## `batchSizeLoad` (type: `integer`):

How many items to fetch per load request. Larger batches load faster but use more memory.

## `uploadBatchSize` (type: `integer`):

How many items to send per push request when writing the output.

## `parallelLoads` (type: `integer`):

How many load batches to fetch at the same time. Higher values speed up large runs but use more memory and resources.

## `parallelPushes` (type: `integer`):

How many output batches to upload at the same time.

## `verboseLog` (type: `boolean`):

Turn on detailed debug logging to trace exactly what the run is doing. Leave off for normal, quieter logs.

## Actor input object example

```json
{
  "datasetIds": [],
  "fields": [
    "name",
    "id"
  ],
  "output": "unique-items",
  "mode": "dedup-after-load",
  "nullAsUnique": false,
  "outputTo": "dataset",
  "appendDatasetIds": false,
  "batchSizeLoad": 50000,
  "uploadBatchSize": 500,
  "parallelLoads": 10,
  "parallelPushes": 5,
  "verboseLog": false
}
```

# Actor output Schema

## `dataset` (type: `string`):

No description

## `keyValueStore` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "datasetIds": [],
    "fields": [
        "name",
        "id"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("datacach/deduplicate-datasets").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "datasetIds": [],
    "fields": [
        "name",
        "id",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("datacach/deduplicate-datasets").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "datasetIds": [],
  "fields": [
    "name",
    "id"
  ]
}' |
apify call datacach/deduplicate-datasets --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=datacach/deduplicate-datasets",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Deduplicate, Merge & Transform Datasets",
        "description": "Merge multiple datasets, deduplicate items by a combination of fields, and apply custom transforms — powered by Polars.",
        "version": "0.0",
        "x-build-id": "NlXxVs1sHwBP1ff5H"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/datacach~deduplicate-datasets/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-datacach-deduplicate-datasets",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/datacach~deduplicate-datasets/runs": {
            "post": {
                "operationId": "runs-sync-datacach-deduplicate-datasets",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/datacach~deduplicate-datasets/run-sync": {
            "post": {
                "operationId": "run-sync-datacach-deduplicate-datasets",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "datasetIds": {
                        "title": "Dataset IDs",
                        "type": "array",
                        "description": "One or more Apify dataset IDs (e.g. <code>dHZ7Xy9aBc...</code>) or dataset names to merge and deduplicate. Items are concatenated in the order listed, which decides which duplicate is kept — the first occurrence wins. Leave this empty (and <b>Actor or Task ID</b> empty too) to run on built-in sample data for a quick first test.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "fields": {
                        "title": "Deduplication fields",
                        "type": "array",
                        "description": "The field(s) whose combined value makes an item unique — e.g. <code>[\"url\"]</code> or <code>[\"name\", \"id\"]</code>. Each value is JSON-stringified and concatenated into a single key, and the first item seen for each key is kept. Leave empty to merge datasets without removing any duplicates.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "output": {
                        "title": "Items to output",
                        "enum": [
                            "unique-items",
                            "duplicate-items",
                            "nothing"
                        ],
                        "type": "string",
                        "description": "Choose what the run returns: the <b>unique</b> items, only the <b>duplicate</b> items that were removed, or <b>nothing</b> (just the unique/duplicate counts — handy for a quick duplicate audit).",
                        "default": "unique-items"
                    },
                    "mode": {
                        "title": "Deduplication mode",
                        "enum": [
                            "dedup-after-load",
                            "dedup-as-loading"
                        ],
                        "type": "string",
                        "description": "<b>Dedup after load</b> loads everything into memory, then deduplicates — fastest for typical datasets. <b>Dedup as loading</b> deduplicates batch-by-batch, keeping memory near-constant for very large (10M+) datasets.",
                        "default": "dedup-after-load"
                    },
                    "nullAsUnique": {
                        "title": "Treat null fields as unique",
                        "type": "boolean",
                        "description": "When on, items whose deduplication field(s) are null or missing are always kept as unique and never removed. When off, a null/missing value is treated like any other value when matching duplicates.",
                        "default": false
                    },
                    "actorOrTaskId": {
                        "title": "Actor or Task ID",
                        "type": "string",
                        "description": "Optionally pull source datasets from every run of an Actor or Task. Enter an ID or full name (e.g. <code>apify/web-scraper</code>); the default dataset of each successful run is merged together with anything in <b>Dataset IDs</b>."
                    },
                    "onlyRunsNewerThan": {
                        "title": "Only runs newer than",
                        "pattern": "^(\\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])(T[0-2]\\d:[0-5]\\d(:[0-5]\\d)?(\\.\\d+)?(Z|[+-]\\d{2}:\\d{2})?)?$",
                        "type": "string",
                        "description": "When using <b>Actor or Task ID</b>, only include runs that finished on or after this date. Accepts a date (<code>2024-01-31</code>) or a full ISO datetime."
                    },
                    "onlyRunsOlderThan": {
                        "title": "Only runs older than",
                        "pattern": "^(\\d{4})-(0[1-9]|1[0-2])-(0[1-9]|[12]\\d|3[01])(T[0-2]\\d:[0-5]\\d(:[0-5]\\d)?(\\.\\d+)?(Z|[+-]\\d{2}:\\d{2})?)?$",
                        "type": "string",
                        "description": "When using <b>Actor or Task ID</b>, only include runs that finished on or before this date. Accepts a date (<code>2024-01-31</code>) or a full ISO datetime."
                    },
                    "datasetIdsOfFilterItems": {
                        "title": "Dataset IDs of filter items",
                        "type": "array",
                        "description": "Datasets used only to pre-seed the 'already seen' keys. Items in <b>Dataset IDs</b> that match these keys are dropped as duplicates, but the filter items themselves are never output — perfect for outputting only records you haven't seen before.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "preDedupTransformFunction": {
                        "title": "Pre-deduplication transform function",
                        "type": "string",
                        "description": "Optional Python that receives <code>items</code> (a list of dicts) plus <code>input_data</code>/<code>customInputData</code> and returns a new list of dicts. Runs on each loaded batch <b>before</b> deduplication, so you can filter or add items. Example: <code>[i for i in items if i.get('price')]</code>."
                    },
                    "postDedupTransformFunction": {
                        "title": "Post-deduplication transform function",
                        "type": "string",
                        "description": "Optional Python that receives <code>items</code> (a list of dicts) and returns a new list of dicts. Runs <b>after</b> deduplication, just before the result is written — ideal for trimming or renaming fields."
                    },
                    "customInputData": {
                        "title": "Custom input data",
                        "type": "object",
                        "description": "Arbitrary JSON passed straight to your transform functions as <code>input_data</code> / <code>customInputData</code>. Use it to parameterize the transforms without editing their code."
                    },
                    "outputTo": {
                        "title": "Output destination",
                        "enum": [
                            "dataset",
                            "key-value-store"
                        ],
                        "type": "string",
                        "description": "Where to store the result: a <b>dataset</b> (downloadable as JSON, CSV, Excel, or HTML) or the run's <b>key-value store</b> under the <code>OUTPUT</code> key.",
                        "default": "dataset"
                    },
                    "outputDatasetId": {
                        "title": "Output dataset ID",
                        "type": "string",
                        "description": "Dataset ID or name to push the output to. If a name doesn't exist yet, a named dataset is created. Leave empty to use this run's default dataset."
                    },
                    "appendDatasetIds": {
                        "title": "Append dataset IDs",
                        "type": "boolean",
                        "description": "When on, every output item gets a <code>__datasetId__</code> field recording which source dataset it came from. When off, items are passed through unchanged.",
                        "default": false
                    },
                    "fieldsToLoad": {
                        "title": "Fields to load",
                        "type": "array",
                        "description": "Load only these fields from the source datasets to save memory and speed up loading — for example, just your deduplication fields when you only need counts. Leave empty to load every field.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "offset": {
                        "title": "Offset",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Skip this many items from the start of the merged input before processing. Leave empty to start from the first item."
                    },
                    "limit": {
                        "title": "Limit",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Process at most this many items from the merged input. Leave empty to process all of them."
                    },
                    "batchSizeLoad": {
                        "title": "Load batch size",
                        "minimum": 1000,
                        "type": "integer",
                        "description": "How many items to fetch per load request. Larger batches load faster but use more memory.",
                        "default": 50000
                    },
                    "uploadBatchSize": {
                        "title": "Upload batch size",
                        "minimum": 10,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "How many items to send per push request when writing the output.",
                        "default": 500
                    },
                    "parallelLoads": {
                        "title": "Parallel loads",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "How many load batches to fetch at the same time. Higher values speed up large runs but use more memory and resources.",
                        "default": 10
                    },
                    "parallelPushes": {
                        "title": "Parallel pushes",
                        "minimum": 1,
                        "maximum": 50,
                        "type": "integer",
                        "description": "How many output batches to upload at the same time.",
                        "default": 5
                    },
                    "verboseLog": {
                        "title": "Verbose logging",
                        "type": "boolean",
                        "description": "Turn on detailed debug logging to trace exactly what the run is doing. Leave off for normal, quieter logs.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
