Dataset Schema Super Actor

Pricing

Pay per usage

Try for free

Go to Apify Store

Dataset Schema Super Actor

Try for free

Create your Actor dataset schema with one click.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Zuzka Pelechová

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

11 hours ago

Last modified

Dataset Schema SuperActor

An automated Apify Actor that streamlines the creation of dataset schemas for Apify Actors. This SuperActor takes you through a complete workflow from input generation to GitHub pull request creation.

Overview

The Dataset Schema SuperActor generates, enhances, validates, and deploys dataset schemas for Apify Actors. It automates the entire lifecycle of dataset schema creation, ensuring consistency and quality across your Actor projects.

Five Core Steps

The Actor operates in five sequential steps, each of which can be enabled or skipped based on your needs:

📝 Step 1: Generate Test Inputs

This step uses Claude Sonnet 4 to generate four types of test inputs for your Actor:

Minimal Input: Basic input with only essential parameters
Normal Input: Realistic input with common optional parameters
Maximal Input: Comprehensive input utilizing all available parameters
Edge Input: Input designed to test error handling while still producing a dataset

The generated inputs are validated against the Actor to ensure they work correctly before proceeding.

When to use:

You don't have existing test inputs
You want to generate comprehensive test coverage automatically
You're testing a new Actor

📊 Step 2: Generate Initial Schema

This step creates a base dataset schema. You have three independent options:

Option A: Generate Inputs First (Step 1)

Enable Step 1 to auto-generate test inputs
The Actor runs with these generated inputs
Schema is extracted from the output datasets

Option B: Provide Your Own Inputs

Set generateInputs: false
Provide existingMinimalInput, existingNormalInput, existingMaximalInput, existingEdgeInput
The Actor runs with your provided inputs
Schema is extracted from the output datasets

Option C: Use Real Production Datasets

Set generateInputs: false and useRealDatasetIds: true
The Actor queries Redash to find recent datasets for your Actor
Samples data from real production runs
Schema is generated from actual user data (no Actor runs needed)

The schema generator Actor analyzes the structure of items in the datasets and creates a JSON Schema definition.

When to use:

Option A: You want everything automated
Option B: You have proven test inputs that work
Option C: You want the most realistic schema based on actual usage patterns

✨ Step 3: Schema Enhancement

This step uses Claude Sonnet 4 to improve the initial schema by:

Adding clear field descriptions
Generating realistic, anonymized examples
Ensuring proper field types and formats
Creating dataset views (if enabled)
Making all fields nullable by default

Important: All fields start as nullable. After the Actor generates the enhanced schema, developers can review it and choose which fields should:

Not be nullable (required for schema validation)
Be added to the required array
Have stricter type constraints

The enhancer works with the existing schema structure and only improves the content—it doesn't add, remove, or rename fields.

When to use:

You want a production-ready schema with documentation
You need dataset views for better visualization
You want to improve existing schema quality

🔍 Step 4: Schema Validation

This step queries Redash to find datasets from recent Actor runs and validates them against the schema:

Fetches datasets from the last N days (configurable)
Samples data from each dataset
Validates data structure against the schema
Reports validation success rate and any errors

Validation ensures the schema accurately represents real Actor output data. It requires a 100% success rate before proceeding to PR creation.

When to use:

You want to verify schema accuracy against real data
You need to catch schema issues before deploying
You want confidence that the schema works with production data

📝 Step 5: GitHub PR Creation

This step automates the GitHub workflow:

Finds the target Actor's repository (supports monorepo structures)
Locates the correct actor.json file in the actor's directory
Creates a new branch
Generates dataset_schema.json with field definitions and views
Moves existing views from actor.json to dataset_schema.json (if they exist)
Updates actor.json to reference the schema
Creates a pull request

When to use:

You want to deploy the schema to your Actor repository
You need an automated PR workflow
You're ready to integrate the schema into your codebase

GitHub Token Permissions Needed:

repo - Full control of private repositories
workflow - Update GitHub Action workflows (if applicable)

💸 Pricing & Costs

You pay for this Actor’s own runtime and for any background Actors it launches. In practice, the end-to-end workflow is very inexpensive:

Typical runs: well under $0.10 total.
Larger runs (more datasets, longer validation windows): usually below $1.

Actual cost depends on factors like number/size of datasets sampled, whether inputs are generated, LLM usage (Claude Sonnet 4), and how many validation checks you run.

Cost controls

Reduce daysBack and maximumResults during validation.
Disable steps you don’t need (generateInputs, validateSchema, createPR).

Note: Background Actors are billed separately under your Apify account, and LLM usage (e.g., Claude Sonnet 4) is billed via your configured API key if applicable.

Usage Examples

Full Workflow

Generate everything automatically from scratch:

{
  "actorTechnicalName": "the-best-dev-ever/ultimate-scraper",
  "generateInputs": true,
  "generateSchema": true,
  "enhanceSchema": true,
  "generateViews": true,
  "validateSchema": true,
  "createPR": true,
  "githubLink": "https://github.com/bestdev/actors",
  "githubToken": "ghp_your_token_here"
}

Using Your Own Inputs

Skip input generation and use your own test inputs:

{
  "actorTechnicalName": "the-best-dev-ever/ultimate-scraper",
  "generateInputs": false,
  "generateSchema": true,
  "existingMinimalInput": "{\"startUrls\": [{\"url\": \"https://example.com/page1\"}], \"maxItems\": 3}",
  "existingNormalInput": "{\"startUrls\": [{\"url\": \"https://example.com/page1\"}], \"maxItems\": 50}",
  "existingMaximalInput": "{\"startUrls\": [{\"url\": \"https://example.com/page1\"}], \"maxItems\": 500, \"extendOutputFunction\": \"...\"}",
  "existingEdgeInput": "{\"startUrls\": [{\"url\": \"https://example.com/nonexistent-page-999\"}], \"maxItems\": 1}",
  "enhanceSchema": true,
  "validateSchema": true,
  "createPR": true,
  "githubLink": "https://github.com/bestdev/actors",
  "githubToken": "ghp_your_token_here"
}

Using Real Production Datasets

Generate schema from production data instead of test inputs:

{
  "actorTechnicalName": "the-best-dev-ever/ultimate-scraper",
  "generateInputs": false,
  "generateSchema": true,
  "useRealDatasetIds": true,
  "enhanceSchema": true,
  "validateSchema": true,
  "daysBack": 7,
  "maximumResults": 20,
  "createPR": true,
  "githubLink": "https://github.com/bestdev/actors",
  "githubToken": "ghp_your_token_here"
}

Schema Enhancement Only

Enhance an existing schema without running any Actors:

{
  "actorTechnicalName": "the-best-dev-ever/ultimate-scraper",
  "generateInputs": false,
  "generateSchema": false,
  "enhanceSchema": true,
  "existingEnhancedSchema": "{\"actorSpecification\": 1, \"fields\": {...}}",
  "validateSchema": false,
  "createPR": true,
  "githubLink": "https://github.com/bestdev/actors",
  "githubToken": "ghp_your_token_here"
}

Skip PR Creation

Generate and validate schema without creating a PR:

{
  "actorTechnicalName": "the-best-dev-ever/ultimate-scraper",
  "generateInputs": true,
  "generateSchema": true,
  "enhanceSchema": true,
  "validateSchema": true,
  "createPR": false
}

Workflow Output

The Actor provides detailed progress information for each step:

{
  "success": true,
  "prUrl": "https://github.com/bestdev/actors/pull/123",
  "progress": {
    "inputGeneration": "completed",
    "schemaGeneration": "completed",
    "schemaEnhancement": "completed",
    "schemaValidation": "completed",
    "prCreation": "completed"
  },
  "details": {
    "actorName": "the-best-dev-ever/ultimate-scraper",
    "generatedSchema": {...},
    "validationResults": {...},
    "prInfo": {...}
  }
}

Technical Details

Schema Format

The Actor generates schemas following the Apify Actor specification format:

{
  "actorSpecification": 1,
  "fields": {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
      "fieldName": {
        "type": "string",
        "description": "Field description",
        "nullable": true,
        "example": "example_value"
      }
    },
    "required": []
  },
  "views": {
    "overview": {
      "title": "Overview",
      "transformation": { "fields": ["field1", "field2"] }
    }
  }
}

Dataset Views

Views are generated automatically when generateViews: true:

Overview view with all fields in table format
Field formatting based on type (image, link, date, number)
Proper field labels from camelCase
Existing views are moved from actor.json to dataset_schema.json

GitHub Integration

The Actor:

Searches for Actor-specific actor.json files in monorepo structures
Supports path patterns like actors/[actor-name]/.actor/actor.json
Moves existing views from actor.json to the new dataset_schema.json
Updates actor.json to reference dataset_schema.json via the storages.dataset field
Preserves existing formatting and structure

Support

For issues, questions, or contributions:

Check the Actor logs for detailed error messages
Review each step's output in the progress tracking
Open an issue (or better yet a PR) in the repository

Dataset(s) To Schema

zuzka/dataset-to-schema

Takes a Dataset ID(s) and outputs a JSON schema of the contents of the dataset into key value store.

Zuzka Pelechová

5.0

(1)

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Lukáš Křivka

4.7K

5.0

(1)

Decision makers Email finder📧 💲$1/1K Emails, Super cheap.

snipercoder/decision-maker-email-finder

|Input: Domain| |Output: Name, Email, LinkedIn, Phone number, Title, Company, etc of Decision makers.| Perfect for Lead Generation, Email campaigns, Data Enrichment. ✅Forget AnymailFinder, apollo.io, hunter.io, they are all to break the Bank.

Sniper Coder

195

3.4

(6)

Linkedin Email Finder⚡ $1/1K Emails, Super Cheap.

snipercoder/linkedin-email-finder

|Input: linkedin| |Output: Name, Email, Title, Company, etc.| Perfect for Email campaigns, Data Enrichment, Linkedin/Linkedin sales nav Leads. ✅Forget Instantly.ai, apollo.io, and hunter.io, they are all to break the Bank.

Sniper Coder

🔥 Y Combinator Scraper [API] 2025 | Super Cheap & Fast

clearpath/ycombinator-api-scraper

Extract complete Y Combinator ecosystem data - 5000+ companies, 8000+ founders, 3500+ jobs. Perfect for VCs, recruiters, and researchers. Get startup intelligence, funding trends, team data, and job listings. Reliable Python scraper with proxy support. Start at $3.50.

ClearPath

Instagram Profile Super Scraper

muhammad_noman_riaz/instagram-profile-super-scraper

The Instagram Profile Super Scraper is a powerful and efficient tool built with Apify and Crawlee that allows you to extract comprehensive data from Instagram profiles. It scrapes public information such as username, bio, followers, following, posts, IGTV videos, and related profiles.

Muhammad Noman Riaz

Reddit Scraper

trudax/reddit-scraper

Unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

8.9K

4.5

(8)

Reddit Scraper Lite

trudax/reddit-scraper-lite

Pay Per Result, unlimited Reddit web scraper to crawl posts, comments, communities, and users without login. Limit web scraping by number of posts or items and extract all data in a dataset in multiple formats.

Gustavo Rudiger

11K

2.0

(16)

Reddit API Scraper ($0.002)

practicaltools/apify-reddit-api

Super fast and affordable ($2/1000, $0.002 per item) Pay Per Event, unlimited Reddit API posts, comments, communities, and users without login. Limit by number of posts or items and extract all data in a dataset in multiple formats.

Practical Tools

5.0

(2)

Ubereats Listing Brands By Country

datacach/ubereats-listing-brands-by-country

Automatically fetch the URLs of every brand (restaurant, market, etc) listed on Uber Eats for the countries you specify. Perfect for market research, competitor analysis, or lead generation. Super fast response and reliable data.

DataCach

Dataset Schema Super Actor

Dataset Schema Super Actor

Dataset Schema SuperActor

Overview

Five Core Steps

📝 Step 1: Generate Test Inputs

📊 Step 2: Generate Initial Schema

Option A: Generate Inputs First (Step 1)

Option B: Provide Your Own Inputs

Option C: Use Real Production Datasets

✨ Step 3: Schema Enhancement

🔍 Step 4: Schema Validation

📝 Step 5: GitHub PR Creation

💸 Pricing & Costs

Usage Examples

Full Workflow

Using Your Own Inputs

Using Real Production Datasets

Schema Enhancement Only

Skip PR Creation

Workflow Output

Technical Details

Schema Format

Dataset Views

GitHub Integration

Support

You might also like

Dataset(s) To Schema

Merge, Dedup & Transform Datasets

Decision makers Email finder📧 💲$1/1K Emails, Super cheap.

Linkedin Email Finder⚡ $1/1K Emails, Super Cheap.

🔥 Y Combinator Scraper [API] 2025 | Super Cheap & Fast

Instagram Profile Super Scraper

Reddit Scraper

Reddit Scraper Lite

Reddit API Scraper ($0.002)

Ubereats Listing Brands By Country