Pricing

Pay per event

Go to Apify Store

🔥 Power Data Transformer

Try for free

Developed by

wiseek

Automate your entire data workflow: clean, merge, filter, deduplicate, enrich, and reshape your datasets using built-in transformation or powerful SQL pipelines — seamlessly integrated with automation platforms like n8n, Make.com, and Zapier.

0.0 (0)

Pricing

Pay per event

Last modified

a month ago

Automation

Integrations

Developer tools

🚀 Why Use This Actor
💡 Quick Start: Your First Transformation
⚙️ How it works: understanding pipeline
💰 Usage Costs and Performance Tips
💬 Feedback
📚 Input Reference

🚀 Why Use This Actor

Power Data Transformer is a powerful and unified data transformation/automation tool designed to handle various data transformations from simple cleaning tasks to entire data-lake creation around scraped data in the Apify platform. It streamlines post-scraping data processing tasks by:

Cleaning and standardizing raw data
Merging multiple data sources
Enriching data with additional information
Supporting incremental updates
Managing data deduplication
Enabling reference data mapping
Multi-data source and target support
Multi-step pipelines and SQL-based transformation for complex data processing
More Built-in Transformations coming soon
Actor integrated ready for easy integration into existing actors

The tool is optimized for performance and scalability, ensuring that your scraped data is clean, consistent, and ready for immediate application.

💡 Quick Start: Your First Transformation

Here’s a minimal example to run your first transformation.

1. 🔼Select a Source Dataset Choose a dataset in the Source Datasets field. It will be referenced as $0 in your transformations.

2. 🔄Define a Transformation Enter a simple SQL query to select the first 10 records:

SELECT * FROM {{$0}} LIMIT 10

Note: In SQL queries, references must be wrapped in {{...}}.

3. 🔽Define an Output To save the result to the default dataset for this run, use:

dataset:

After running the actor, you can find the transformed data in the output dataset. For more complex scenarios, refer to the Example Use Cases as templates.

⚙️ How it works: understanding pipeline

Power Data Transformer is a powerful and flexible tool. This means that while it can handle very complex tasks, it might take a little time to get used to all its features. Once you're familiar with the core concepts, you'll be able to build sophisticated data pipelines with ease.

A pipeline is defined by a series of steps, processed as a Directed Acyclic Graph (DAG), consisting of three main parts: Sources, Transformations, and Outputs. You can copy and paste the following examples into the input UI to get started.

1. Data Referencing: The Core of Your Pipeline

In Power Data Transformer, you use special symbols $n and #n (where n is a positive integer) to refer to your data sources and the results of your transformation steps. The key is how you use them:

General Rules

$n - Referencing a Source:
- $0 always refers to the main dataset(s) selected in the Datasets UI field.
- $1, $2, etc., refer to the data sources defined in the Sources field, in the order they are listed (1-indexed).
#n - Referencing a Transformation Result:
- #1, #2, etc., refer to the result of the corresponding transformation step in the pipeline (1-indexed).

How to Use References

In SQL Queries: References must be wrapped in double curly braces {{...}}. This tells the actor to substitute the reference with the actual data table. Bare references like $1 or #2 will be treated as plain text.
- Correct: SELECT * FROM {{$0}}
- Incorrect: SELECT * FROM $0
In Built-in transformation (dedup, ref, etc.): The {{...}} wrapper is optional for from and ref_table parameters. Both of the following are valid and do the same thing:
- dedup: from='#1', ...
- dedup: from='{{#1}}', ...

Shorthand and Default Behaviors

To make your pipelines cleaner and easier to write, Power Data Transformer includes some convenient shorthands:

$: When used alone, $ is a shorthand that automatically resolves to the appropriate main data source:
- If datasets are selected in the Datasets field, $ refers to $0.
- If no datasets are selected, $ refers to $1 (the first source in Additional Sources).
#: When used alone, # refers to the result of the immediately preceding step. This is very useful for chaining transformations together.
Omitting from: In built-in transformation functions (like dedup, select, etc.), the from parameter is optional. If you omit it, it automatically defaults to from='#', using the result of the previous step as its input.

Example Illustrating the Rules:

Consider a pipeline where we want to find all recently active users and enrich their profiles with their country name.

Datasets:
- A dataset of user profiles (referenced as $0).
Sources:
1. A dataset of recent user activity, containing user_id (referenced as $1).
2. A key-value store with country codes and names (referenced as $2).
Transformations:
1. ref_in: from='$0', ref_table='$1', key_fields='id', ref_fields='user_id'
2. ref: ref_table='$2', key_fields='country_code', fields='country_name'

Explanation:

Step 1 takes the main user profiles ($0) and filters them, keeping only the users who are also present in the recent activity dataset ($1). The result of this step is implicitly passed to the next one.
Step 2 takes the active users from the previous step (using the default from='#') and joins them with the country codes table ($2) to add a country_name column.
#1 could be used in a later step to refer back to the result of the first transformation (the filtered, active users).

🔼 2. Sources: Where Your Data Comes From

Sources define the data you want to process. You can pull data from various locations, including Apify datasets, Key-Value Stores, cloud storage, or public URLs.

A single source can include multiple data files by using filters or wildcards (such as * or ?). All files included in one source must have the same schema type, and the system will automatically merge them into a unified table for processing.

Supported Sources: dataset, kvstore, http, s3, gcs, r2.
Format: Each source can specify a format (csv, json, or parquet) to ensure the data is read correctly.
- If the source name or URI already includes a file extension indicating the format, the format parameter can be omitted.

Examples:

dataset: actor_name='apify/website-content-crawler', last='1 day'
kvstore: name='reference-data', key='country-codes', format='json'

Load all datasets from the latest 1 day run of actor and assign it to $1: Load a specific key-value store record in JSON format and assign it to $2:

For detailed parameters of each source type, see the Input Reference section.

🔄 3. Transformations: How You Process Your Data

Transformations are the core of your pipeline, where you clean, reshape, and enrich your data using SQL queries or our built-in functions (dedup, select, ref, ref_in).

This actor uses DuckDB as its SQL engine, a powerful and efficient database designed for analytical queries. This means you can leverage a rich set of SQL features and an extensive library of built-in functions to handle everything from simple data cleaning to complex aggregations and transformations.

For detailed syntax and usage, refer to the official DuckDB SQL Documentation.

Example of a multi-step transformation:

# Step 1: Clean data from the main source ($0) and create a new 'domain' column.
SELECT id, TRIM(name) as name, email, SPLIT(email, '@')[2] as domain
FROM {{$0}}
WHERE email IS NOT NULL AND name != ''

# Step 2: Deduplicate the result of the previous step based on email.
# 'from' is omitted and defaults to the result of Step 1.
dedup: key_fields='email', order_by='scrapedAt desc'

# Step 3: Join the deduplicated data with a country reference table from the first source ($1).
# 'from' is omitted and defaults to '#'.
ref: ref_table='$1', key_fields='country_code', ref_fields='code', fields='country_name'

For detailed syntax, refer to the DuckDB documentation and our Input Reference.

🔽 4. Outputs: Where Your Data Goes

Outputs define the destination for your processed data. You can save the result of any transformation step to an Apify dataset, Key-Value Store, or cloud storage.

Supported Outputs: dataset, kvstore, s3, gcs, r2.
Write Modes: You can overwrite files, append data, or perform upsert operations for incremental updates.

Examples:

Save the final result to a new dataset named cleaned-data:

dataset: name='cleaned-data'

Save the result of the first transformation step (#1) to a key-value store:

kvstore: from='#1', name='intermediate-results', key='step1-output', format='parquet'

Upsert the final result into a table in an S3 bucket:

s3: uri='s3://my-bucket/processed/products.parquet', write_mode='upsert', key_fields='productId'

For detailed parameters of each output type, see the Input Reference section.

💰 Usage Costs and Performance Tips

The cost of running this actor is typically just a few cents per run for simple transformations on small data. Actual costs depend on memory allocation, the number of transformation steps, and the amount of data processed (read and written). More complex pipelines or processing large volumes of data may result in higher costs. Always review your configuration and monitor usage to optimize for both performance and cost.

Pricing Breakdown

Charged Event	Description	Price (USD)
Actor Start	Fee for starting an actor, charged per 1 GB of allocated memory.	$0.005
Transformation Step	Fee for each transformation step, charged per 1 GB of allocated memory.	$0.01
Dataset Write	Fee for writing to a dataset, charged per 100 records.	$0.001
Key-Value Store Write	Fee for writing to a key-value store, charged per 1 MB.	$0.0002
Cloud Storage Write	Fee for writing to cloud storage (S3, GCS, R2), charged per 1 MB.	$0.001

Notes:

Prices shown are for the free-tier plan. Other plans may offer additional discounts.

Reading from a dataset (charged by Apify) is approximately $0.0004 per 1,000 records on the free-tier plan.

Memory Configuration Advice

Memory is the primary driver of cost and performance for this actor. The amount of memory required depends directly on the volume and complexity of the data you are processing. Here are some tips for optimizing your memory settings:

Start Small and Observe: For a new transformation, start with a lower memory setting (e.g., 1024 MB). After the run completes, check the Max memory value in the run logs.
Set an Appropriate Buffer: As a general rule, set the memory for your next run to be at least 1.5x to 2x the observed maximum memory usage from a similar, successful run. This provides a safe buffer for fluctuations in data size.
Handling OOM Errors: If your actor run fails with an Out Of Memory (OOM) error, you will need to increase the memory allocation for the next run.
Filter Early: Reduce the amount of data processed in later, more complex steps by filtering out unneeded records as early as possible in your transformation pipeline.
Be Mindful of Complex Operations: Operations like JOINs on large datasets or GROUP BY on columns with many unique values are memory-intensive. Ensure you have allocated sufficient memory when performing these transformations.

Storage File Format Recommendation

When saving data to Key-Value Store or S3, it is recommended to use the parquet format whenever possible. Parquet is a modern, columnar storage format that significantly reduces storage size and improves read performance, making it a best practice in modern data stacks.

💬 Feedback

We’re always working to improve the performance of our Actors. If you have technical feedback, feature requests, or have found a bug in Power Data Transformer, please create an issue on the Actor’s Issues tab in the Apify Console.

For any questions, suggestions, or support needs, feel free to email us at info@wiseek.io.

📚 Input Reference

This section provides a detailed list of all supported types and parameters for each configuration stage.

🔼 Sources

Define where your data comes from in the Sources field. Each line should adhere to the following format:

<Data Source type>: <param1>='value1', <param2>='value2' ...

Example

kvstore: id='Eorbqkh71bsuARzkq', key='google-map-*', format='json'
http: url='https://example.com/data.json', format='json'
s3: uri='s3://bucket-name/data.csv', format='csv'

Supported Source Types & Parameters

dataset

Parameters:

Parameter	Description
`id`	Dataset ID
`name`	Dataset name (e.g., `username/dataset-name`)
`actor_id`	Actor ID to fetch dataset from
`actor_name`	Actor name to fetch dataset from
`task_id`	Task ID to fetch dataset from
`task_name`	Task name to fetch dataset from
`start_date`	Start date for filtering (format: `YYYY-MM-DD` or `YYYY-MM-DD HH:MM:SS`)
`end_date`	End date for filtering (format: `YYYY-MM-DD` or `YYYY-MM-DD HH:MM:SS`)
`last`	Time range for filtering (e.g., '5 days', '3 hours')

Note:
- id/name, actor_id/actor_name, and task_id/task_name are mutually exclusive.
- last and start_date/end_date are mutually exclusive.

kvstore (or key-value storetore)

Parameters:

Parameter	Description
`id`	Key-value store ID
`name`	Key-value store name
`key`	Key of the record to fetch, wildcards `*` and `?` are supported.
`format`	`csv`, `json`, or `parquet`.
...	Other parameters like `actor_id`, `last`, etc., are similar to `dataset`.

http
- Parameters:
  Parameter Description
  url URL of the data source.
  format csv, json, or parquet. Optional if the URL implies the format.
s3, gcs, r2
- Parameters:
  Parameter Description
  uri URI of the data in cloud storage.
  format csv, json, or parquet. Optional if the URI implies the format.
- Note: Requires credentials to be set in the Advanced Settings section.
meta
- Description: Access internal metadata about your Apify resources, such as a list of all your datasets, actor runs, or tasks. This is useful for dynamic pipelines that need to operate on other resources.
- Example: meta: table='datasets'
- Parameters:
  Parameter Description
  table The type of metadata to query. Supported values are: datasets, runs, tasks, key-value storetores, actors.

Parameter	Description
`url`	URL of the data source.
`format`	`csv`, `json`, or `parquet`. Optional if the URL implies the format.

Parameter	Description
`uri`	URI of the data in cloud storage.
`format`	`csv`, `json`, or `parquet`. Optional if the URI implies the format.

Parameter	Description
`table`	The type of metadata to query. Supported values are: `datasets`, `runs`, `tasks`, `key-value storetores`, `actors`.

🔄 Transformations

Define the steps to process your data. Each line is a step in the pipeline.

Supported Transformation Types

SQL Query
- Description: A standard SQL query. Remember to wrap references like {{$0}} or {{#1}} in curly braces.
- Example: SELECT id, name, LOWER(email) as email FROM {{$0}} WHERE name IS NOT NULL

Built-in Transformations

Description: Pre-defined functions to simplify common tasks.
Syntax: <function_name>: from='<reference>', param1='value1', ...

dedup: Removes duplicate records.

Example: dedup: key_fields='email', order_by='createdAt desc'

Parameters:

Parameter	Description
`from`	Optional. Source table to process (e.g., `$0`, `#1`). Defaults to the previous step (`#`).
`key_fields`	Comma-separated fields to identify duplicates.
`order_by`	Field(s) to sort by to decide which record to keep.
`reverse`	`true` or `false`. Set to `true` for descending order.

select: Selects, renames, excludes, or adds columns.

Example: select: include='id,name', append='domain=SPLIT(email, ''@'')[2]'

Parameters:

Parameter	Description
`from`	Optional. Source table to process. Defaults to the previous step (`#`). Both `from='...'` and `from='{{...}}'` are accepted.
`include`	Comma-separated fields to include.
`exclude`	Comma-separated fields to exclude.
`replace`	Fields to replace (e.g., `age=age*2`).
`append`	Fields to append (e.g., `full_name=first_name

ref: Joins with a reference table to add columns (LEFT JOIN).

Example: ref: ref_table='{{$1}}', key_fields='country_code', ref_fields='code', fields='country_name'

Parameters:

Parameter	Description
`from`	Optional. The main table. Defaults to the previous step (`#`).
`ref_table`	The reference table to join with. Both `ref_table='...'` and `ref_table='{{...}}'` are accepted.
`key_fields`	Key(s) from the source table for joining.
`ref_fields`	Key(s) from the reference table for joining.
`fields`	Comma-separated fields to add from the reference table.
`filter`	`true` or `false`. If `true`, acts like an INNER JOIN.

ref_in: Filters rows based on a reference table (SEMI/ANTI JOIN).

Example: ref_in: ref_table='{{$1}}', key_fields='id', ref_fields='user_id', anti=true

Parameters:

Parameter	Description
`from`	Optional. The table to filter. Defaults to the previous step (`#`).
`ref_table`	The reference table to check against. Both `ref_table='...'` and `ref_table='{{...}}'` are accepted.
`key_fields`	Key(s) from the source table.
`ref_fields`	Key(s) from the reference table.
`anti`	`true` or `false`. If `true`, keeps only rows that do not have a match.

🔽 Outputs

Define where to save your transformed data. Each line should adhere to the following format:

<Output type>: <param1>='value1', <param2>='value2' ...

Example

dataset: 
dataset: name='cleaned-emails.json'
key-value storetore: name='processed-data', key='output', format='json'
s3: uri='s3://bucket-name/output.csv', format='csv'

Supported Output Types & Parameters

dataset

Parameters:

Parameter	Description
`from`	Optional. The transformation step to save (e.g., `#2`). Defaults to the last step. Both `from='...'` and `from='{{...}}'` are accepted.
`id` / `name`	The ID or name of the destination dataset. Defaults to the actor's default dataset.

kvstore

Parameters:

Parameter	Description
`from`	Optional. The transformation step to save. Both `from='...'` and `from='{{...}}'` are accepted.
`id` / `name`	The ID or name of the destination store.
`key`	The key for the record. Placeholders `{{current_date}}` and `{{current_time}}` are supported.
`format`	`csv`, `json`, or `parquet`.
`write_mode`	`overwrite` (default), `append`, `incremental`, or `upsert`.
`key_fields`	Required for `incremental` or `upsert` modes.

s3, gcs, r2

Parameters:

Parameter	Description
`from`	Optional. The transformation step to save. Both `from='...'` and `from='{{...}}'` are accepted.
`uri`	The URI for the cloud storage destination. Placeholders are supported.
`format`	`csv`, `json`, or `parquet`.
`write_mode`	`overwrite` (default), `append`, `incremental`, or `upsert`.
`key_fields`	Required for `incremental` or `upsert` modes.

Advanced Settings

Cloud Storage Credentials: Provide your credentials (Key ID, Secret, etc.) if you are using cloud storage.
Timezone: Set the timezone for date-related operations (e.g., {{current_date}}). Default is UTC.
Debug: Enable to get more detailed logs.

On this page

Table of Contents
🚀 Why Use This Actor
💡 Quick Start: Your First Transformation
⚙️ How it works: understanding pipeline
💰 Usage Costs and Performance Tips
💬 Feedback
📚 Input Reference

Share Actor:

n8n MCP Template

sameh.jarour/n8n-mcp-template

Sameh George Jarour

bullhorn-email-checker

triggernode/bullhorn-email-checker

Bullhorn Email Checker stops duplicate candidates in seconds. It logs into Bullhorn, searches by email, and returns JSON with `exists` and `candidateId`. Plug it into Zapier, n8n or API flows to skip “Create Candidate” when a match is found—no coding required.

{t}riggernode

The Fastest N8n Templates Scraper

intelecta/n8n-templates-scraper

Scrapes and collects n8n.io workflow templates by category and apps. Flexible sorting, structured output, and easy filtering. Perfect for automation experts, developers, and businesses seeking workflow inspiration or analysis.

Intelecta.ai

5.0

N8n Workflows Scraper ($0.25/K Results)

memrpiyush/n8n-workflows-scraper-0-25-k-results

Automate n8n.io workflow template collection with this Apify actor. Fetch importable JSONs for direct n8n use. Ideal for developers, automation experts & businesses. Price 0.25 / 1000 Results.

Piyush

5.0

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Lukáš Křivka

4.4K

n8n Workflow Automation Templates Scraper

scraped/n8n-workflow-automation-templates-scraper

A tool that automatically scrapes and collects n8n workflow automation templates from the n8n for easy access and use.

scraped

5.0

n8n Workflow Template Scraper

muhammetakkurtt/n8n-scraper

Automate n8n.io workflow template collection with this Apify actor. Scrape by category (AI, Marketing, DevOps), sort (relevancy, popularity), & get detailed structured data. Fetch importable JSONs for direct n8n use. Ideal for developers, automation experts & businesses.

Muhammet Akkurt

227

5.0

Scraped Data Cleaner & Converter (No-Code CSV/JSON Tool) Rental

m3web/scraped-data-cleaner-rental

Clean and organize scraped .json or .csv data — no coding required. Remove duplicates, empty rows, unwanted columns, and sort by any field. Cleaned results are pushed to your Apify dataset. Perfect for marketers, researchers, and no-code workflows.

M3Web

n8n Template Scraper

datavoyantlab/n8n-templates-scraper

Scrape n8n.io workflow template with this Apify actor. Filter by category (AI, Marketing, DevOps), sort by popularity & get structured data with importable JSONs for developers & businesses

DataVoyantLab

5.0

Image Text Extractor

m3web/image-text-extractor

Extract text from images using OCR (Optical Character Recognition) via direct URLs or uploaded JSON/CSV files. Works with multiple languages and automatically enriches your structured file with the text found inside images.

M3Web