LLM Dataset Processor

Pricing

Pay per usage

LLM Dataset Processor

Allows you to process output of other actors or stored dataset with single LLM prompt. It's useful if you need to enrich data, summarize content, extract specific information, or manipulate data in a structured way using AI.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Dušan Vystrčil

Maintained by Community

Actor stats

Bookmarked

112

Total users

Monthly active users

a month ago

Last modified

Categories

Open source

You can access the LLM Dataset Processor programmatically from your own applications by using the Apify API. You can also choose the language preference from below. To use the Apify API, you’ll need an Apify account and your API token, found in Integrations settings in Apify Console.

Python

JavaScript

CLI

OpenAPI

HTTP

MCP

{
  "openapi": "3.0.1",
  "info": {
    "version": "0.0",
    "x-build-id": "DS2PoXpKaWqF7cTZW"
  },
  "servers": [
    {
      "url": "https://api.apify.com/v2"
    }
  ],
  "paths": {
    "/acts/dusan.vystrcil~llm-dataset-processor/run-sync-get-dataset-items": {
      "post": {
        "operationId": "run-sync-get-dataset-items-dusan.vystrcil-llm-dataset-processor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    },
    "/acts/dusan.vystrcil~llm-dataset-processor/runs": {
      "post": {
        "operationId": "runs-sync-dusan.vystrcil-llm-dataset-processor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor and returns information about the initiated run in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK",
            "content": {
              "application/json": {
                "schema": {
                  "$ref": "#/components/schemas/runsResponseSchema"
                }
              }
            }
          }
        }
      }
    },
    "/acts/dusan.vystrcil~llm-dataset-processor/run-sync": {
      "post": {
        "operationId": "run-sync-dusan.vystrcil-llm-dataset-processor",
        "x-openai-isConsequential": false,
        "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
        "tags": [
          "Run Actor"
        ],
        "requestBody": {
          "required": true,
          "content": {
            "application/json": {
              "schema": {
                "$ref": "#/components/schemas/inputSchema"
              }
            }
          }
        },
        "parameters": [
          {
            "name": "token",
            "in": "query",
            "required": true,
            "schema": {
              "type": "string"
            },
            "description": "Enter your Apify token here"
          }
        ],
        "responses": {
          "200": {
            "description": "OK"
          }
        }
      }
    }
  },
  "components": {
    "schemas": {
      "inputSchema": {
        "type": "object",
        "required": [
          "llmProviderApiKey",
          "prompt",
          "model",
          "temperature",
          "maxTokens"
        ],
        "properties": {
          "inputDatasetId": {
            "title": "Input Dataset ID",
            "type": "string",
            "description": "The ID of the dataset to process."
          },
          "model": {
            "title": "Large Language Model",
            "type": "string",
            "description": "The LLM to use for processing. Each model has different capabilities and pricing. GPT-4o-mini and Claude 3.5 Haiku are recommended for cost-effective processing, while models like Claude 3 Opus or GPT-4o offer higher quality but at a higher cost."
          },
          "llmProviderApiKey": {
            "title": "LLM Provider API Key",
            "type": "string",
            "description": "Your API key for the LLM Provider (e.g., OpenAI)."
          },
          "temperature": {
            "title": "Temperature",
            "type": "string",
            "description": "Sampling temperature for the LLM API (controls randomness). We recommend using a value closer to 0 for exact results. In case of more 'creative' results, we recommend to use a value closer to 1.",
            "default": "0.1"
          },
          "multipleColumns": {
            "title": "Multiple columns in output",
            "type": "boolean",
            "description": "When enabled, instructs the LLM to return responses as JSON objects, creating multiple columns in the output dataset. The columns need to be named and described in the prompt. If disabled, responses are stored in a single `llmresponse` column.",
            "default": false
          },
          "prompt": {
            "title": "Prompt Template",
            "minLength": 1,
            "type": "string",
            "description": "The prompt template to use for processing. You can use ${fieldName} placeholders to reference fields from the input dataset."
          },
          "skipItemIfEmpty": {
            "title": "Skip item if one or more ${fields} are empty",
            "type": "boolean",
            "description": "When enabled, items will be skipped if any ${field} referenced in the prompt is empty, null, undefined, or contains only whitespace. This helps prevent processing incomplete data.",
            "default": true
          },
          "maxTokens": {
            "title": "Max Tokens",
            "type": "integer",
            "description": "Maximum number of tokens in the LLM API response for each item.",
            "default": 300
          },
          "testPrompt": {
            "title": "Test Prompt Mode",
            "type": "boolean",
            "description": "Test mode that processes only a limited number of items (defined by `testItemsCount`). Use this to validate your prompt and configuration before running on the full dataset. We highly recommend enabling this option first to validate your prompt because of ambiguity of the LLM responses.",
            "default": true
          },
          "testItemsCount": {
            "title": "Test Items Count",
            "minimum": 1,
            "type": "integer",
            "description": "Number of items to process when `Test Prompt Mode` is enabled.",
            "default": 3
          },
          "preprocessingFunction": {
            "title": "Preprocessing function",
            "type": "string",
            "description": "Function to transform item before they are put into the promp."
          }
        }
      },
      "runsResponseSchema": {
        "type": "object",
        "properties": {
          "data": {
            "type": "object",
            "properties": {
              "id": {
                "type": "string"
              },
              "actId": {
                "type": "string"
              },
              "userId": {
                "type": "string"
              },
              "startedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "finishedAt": {
                "type": "string",
                "format": "date-time",
                "example": "2025-01-08T00:00:00.000Z"
              },
              "status": {
                "type": "string",
                "example": "READY"
              },
              "meta": {
                "type": "object",
                "properties": {
                  "origin": {
                    "type": "string",
                    "example": "API"
                  },
                  "userAgent": {
                    "type": "string"
                  }
                }
              },
              "stats": {
                "type": "object",
                "properties": {
                  "inputBodyLen": {
                    "type": "integer",
                    "example": 2000
                  },
                  "rebootCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "restartCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "resurrectCount": {
                    "type": "integer",
                    "example": 0
                  },
                  "computeUnits": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "options": {
                "type": "object",
                "properties": {
                  "build": {
                    "type": "string",
                    "example": "latest"
                  },
                  "timeoutSecs": {
                    "type": "integer",
                    "example": 300
                  },
                  "memoryMbytes": {
                    "type": "integer",
                    "example": 1024
                  },
                  "diskMbytes": {
                    "type": "integer",
                    "example": 2048
                  }
                }
              },
              "buildId": {
                "type": "string"
              },
              "defaultKeyValueStoreId": {
                "type": "string"
              },
              "defaultDatasetId": {
                "type": "string"
              },
              "defaultRequestQueueId": {
                "type": "string"
              },
              "buildNumber": {
                "type": "string",
                "example": "1.0.0"
              },
              "containerUrl": {
                "type": "string"
              },
              "usage": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "integer",
                    "example": 1
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              },
              "usageTotalUsd": {
                "type": "number",
                "example": 0.00005
              },
              "usageUsd": {
                "type": "object",
                "properties": {
                  "ACTOR_COMPUTE_UNITS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATASET_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "KEY_VALUE_STORE_WRITES": {
                    "type": "number",
                    "example": 0.00005
                  },
                  "KEY_VALUE_STORE_LISTS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_READS": {
                    "type": "integer",
                    "example": 0
                  },
                  "REQUEST_QUEUE_WRITES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_INTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "DATA_TRANSFER_EXTERNAL_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                    "type": "integer",
                    "example": 0
                  },
                  "PROXY_SERPS": {
                    "type": "integer",
                    "example": 0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

LLM Dataset Processor OpenAPI definition

OpenAPI is a standard for designing and describing RESTful APIs, allowing developers to define API structure, endpoints, and data formats in a machine-readable way. It simplifies API development, integration, and documentation.

OpenAPI is effective when used with AI agents and GPTs by standardizing how these systems interact with various APIs, for reliable integrations and efficient communication.

By defining machine-readable API specifications, OpenAPI allows AI models like GPTs to understand and use varied data sources, improving accuracy. This accelerates development, reduces errors, and provides context-aware responses, making OpenAPI a core component for AI applications.

You can download the OpenAPI definitions for LLM Dataset Processor from the options below:

OpenAPI.json

If you’d like to learn more about how OpenAPI powers GPTs, read our blog post.

You can also check out our other API clients:

LLM Dataset Processor API in Python

LLM Dataset Processor API in JavaScript

LLM Dataset Processor API through CLI

LLM Dataset Processor API

Output to Dataset

njoylab/apify-output-to-dataset

Merges outputs from multiple actors into a single dataset. Execute actors in series or parallel, combine data from datasets, key-value stores, webhooks, and export the final output in various formats.

njoylab

5.0

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Lukáš Křivka

4.7K

5.0

Sort Dataset Items

lukaskrivka/sort-dataset-items

Add this actor as a webhook to your scraper to sort the dataset by index field

Lukáš Křivka

🔥 Power Data Transformer

wiseek/power-data-transformer

🔥 Unlock your scraped data—clean, merge, split, deduplicate, filter, standardize, validate, enrich and sync—using built-in transformations and powerful SQL pipelines for ETL/ELT workflows. Seamlessly integrate processed datasets with automation platforms like n8n, Make.com, and Zapier.

wiseek

Dataset Toolbox

cyberfly/dataset-toolbox

Perform common actions on datasets - merge, unify, validate, transform, order fields etc.

Vasek Codey Vlcek

Contact Details Merge & Deduplicate

lukaskrivka/contact-details-merge-deduplicate

Merge and deduplicate all contacts extracted by Contact Details Scraper. Works with multiple datasets. One row per domain.

Lukáš Křivka

115

Data Cleaning Actor

tonic_teak/data-cleaning-actor

This actor automatically cleans, analyzes, and summarizes spreadsheet data. It handles different file types (CSV, XLSX), fixes missing values, detects outliers, generates charts, computes correlations, and returns a cleaned dataset along with downloadable files and visual insights.

Mitchell Wanjiru

Scraped Data Cleaner & Converter (No-Code CSV/JSON Tool) - PPE

m3web/scraped-data-cleaner-ppe

Clean and organize scraped .json or .csv data — no coding required. Remove duplicates, empty rows, unwanted columns, and sort by any field. Cleaned results are stored in Apify's Key-Value Store. Perfect for marketers, researchers, and no-code workflows.

M3Web

Find Sitemap from url

eesti/find-sitemap-from-url

A powerful [Apify Actor] that finds sitemap URLs for any website. This Actor helps you discover XML sitemaps by checking common locations, robots.txt files, and analyzing HTML content for sitemap links.

ando

162

1.0

Linkedin Post Scraper (No Cookie)

datadoping/linkedin-post-scraper

For just $1.2 per 1,000 posts, scrape LinkedIn post and get complete post information, including author information, post content, engagement data, media and tags etc. Note: If you're on free tier you can only scrape 100 posts in total.