Synthetic Dataset Generator
Pricing
Pay per event
Synthetic Dataset Generator
Generate realistic synthetic datasets with correlated fields, built-in presets (user profiles, companies, e-commerce products, log events), custom schemas, deterministic seeding, and multiple output formats (JSON, CSV, NDJSON).
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
5 days ago
Last modified
Categories
Share
Generate realistic synthetic datasets with correlated fields, built-in presets, custom schemas, deterministic seeding, and multiple output formats.
What does Synthetic Dataset Generator do?
This actor generates synthetic (fake) data on demand -- no web scraping required. It produces structured datasets using Faker.js for realistic data generation and Copycat for deterministic, reproducible output. Use it to create test data, seed databases, populate staging environments, or benchmark data pipelines.
Key features
- Built-in presets -- One-click generation for user profiles, companies, e-commerce products, and server log events
- Custom schemas -- Define your own field names and types to generate exactly the data shape you need
- Cross-field correlations -- Age tracks with salary, company size correlates with revenue, out-of-stock items have zero quantity
- Deterministic mode -- Set a random seed for identical output across runs (same seed + same config = same data every time)
- Multiple output formats -- JSON (Apify dataset), CSV, or NDJSON
- 16 locales -- Generate names, addresses, and phone numbers in English, German, French, Japanese, Chinese, and more
- Fast and cheap -- No proxy, no browser, no external API calls. Generates 10,000+ records per second on 256MB memory
Built-in presets
| Preset | Fields | Example use case |
|---|---|---|
| User Profiles | id, name, email, phone, DOB, age, gender, address, job title, company, salary | Test user databases, CRM seed data |
| Companies | company ID, name, industry, employees, revenue, founded year, website, CEO | Business directory mock data |
| E-commerce Products | product ID, name, description, category, price, SKU, stock, rating, reviews | Product catalog testing |
| Log Events | timestamp, level, service, host, request ID, method, path, status code, response time | Log pipeline testing, SIEM demos |
Custom schema
Define any combination of fields with the customSchema input (JSON array):
[{ "name": "user_id", "type": "uuid" },{ "name": "username", "type": "name" },{ "name": "signup_date", "type": "datetime" },{ "name": "plan", "type": "enum", "options": { "values": ["free", "pro", "enterprise"], "weights": [0.7, 0.2, 0.1] } },{ "name": "monthly_spend", "type": "number", "options": { "min": 0, "max": 500 } }]
Supported field types
string, integer, number, boolean, date, datetime, email, phone, address, name, first_name, last_name, company, url, uuid, city, state, zip, country, job_title, salary, sentence, paragraph, enum
For enum type, pass options.values (array of choices) and optionally options.weights (probability weights).
Input
| Parameter | Type | Default | Description |
|---|---|---|---|
preset | string | user_profiles | Built-in preset or custom for custom schema |
recordCount | integer | 100 | Number of records to generate (1 to 500,000) |
customSchema | string | JSON array of field definitions (only used when preset is custom) | |
locale | string | en | Language/region for generated data |
seed | integer | 0 | Random seed for deterministic output (0 = random) |
outputFormat | string | json | Output format: json, csv, or ndjson |
enableCorrelations | boolean | true | Apply cross-field correlations for realistic data |
Output
JSON format (default)
Records are pushed to the Apify dataset. Each record is a flat JSON object matching the selected preset or custom schema.
CSV / NDJSON format
The file is saved to the key-value store under the key OUTPUT. A summary record is also pushed to the dataset with download instructions.
Example output (user_profiles preset)
{"id": "a1abcd86-4c43-4300-864f-066b9f5e43eb","first_name": "Lydia","last_name": "MacGyver","email": "Taryn8@gmail.com","phone": "(444) 909-7300","date_of_birth": "2005-01-20","age": 21,"gender": "Female","address": "8239 Johnston Shore","city": "West Chanelleburgh","state": "Utah","zip": "00084","country": "Holy See (Vatican City State)","job_title": "Principal Applications Developer","company": "Nienow-Gibson, Bruen and Mayer","salary": 109866,"created_at": "2025-09-23T21:04:26.698Z"}
Cost
This actor uses Pay Per Event pricing:
- $0.10 per actor start
- $0.0001 per data record generated
Example: Generating 10,000 user profiles costs $0.10 (start) + $1.00 (records) = $1.10 total.
Performance
| Records | Approximate time | Memory |
|---|---|---|
| 100 | < 1 second | 256 MB |
| 1,000 | ~1 second | 256 MB |
| 10,000 | ~5 seconds | 256 MB |
| 100,000 | ~30 seconds | 256 MB |
Use cases
- Database seeding -- Populate development and staging databases with realistic test data
- API testing -- Generate request/response payloads for load testing and integration tests
- Data pipeline validation -- Feed synthetic data through ETL pipelines to verify transformations
- UI prototyping -- Fill dashboards and reports with realistic-looking data
- Machine learning -- Generate training data for models that need structured tabular input
- Demo environments -- Create convincing demo data without using real customer information
Need more features?
If you need additional field types, presets, or output formats, file an issue or get in touch. We actively maintain this actor and welcome feature requests.