
π₯ Power Data Transformer
Pricing
Pay per event

π₯ Power Data Transformer
Streamline data integration and automation. Clean, merge, deduplicate, enrich, and transform your data using flexible multi-step pipelines and the full power of SQL
0.0 (0)
Pricing
Pay per event
0
Total users
3
Monthly users
3
Runs succeeded
90%
Last modified
2 days ago
Table of Contents
- π Why Use This Actor
- π‘ Quick Start: Your First Transformation
- βοΈ How it works: understanding pipeline
- π° Usage Costs and Performance Tips
- π¬ Feedback
- π Input Reference
π Why Use This Actor
Power Data Transformer is a powerful and unified data transformation/automation tool designed to handle various data transformations from simple cleaning tasks to entire data-lake creation around scraped data in the Apify platform. It streamlines post-scraping data processing tasks by:
- Cleaning and standardizing raw data
- Merging multiple data sources
- Enriching data with additional information
- Supporting incremental updates
- Managing data deduplication
- Enabling reference data mapping
- Multi-data source and target support
- Multi-step pipelines and SQL-based transformation for complex data processing
- More Built-in Transformations coming soon
- Actor integrated ready for easy integration into existing actors
The tool is optimized for performance and scalability, ensuring that your scraped data is clean, consistent, and ready for immediate application.
π‘ Quick Start: Your First Transformation
Hereβs a minimal example to run your first transformation.
1. πΌSelect a Source Dataset
Choose a dataset in the Source Datasets field. It will be referenced as $0
in your transformations.
2. πDefine a Transformation Enter a simple SQL query to select the first 10 records:
SELECT * FROM {{$0}} LIMIT 10
Note: In SQL queries, references must be wrapped in {{...}}
.
3. π½Define an Output To save the result to the default dataset for this run, use:
dataset:
After running the actor, you can find the transformed data in the output dataset. For more complex scenarios, refer to the Example Use Cases as templates.
βοΈ How it works: understanding pipeline
Power Data Transformer is a powerful and flexible tool. This means that while it can handle very complex tasks, it might take a little time to get used to all its features. Once you're familiar with the core concepts, you'll be able to build sophisticated data pipelines with ease.
A pipeline is defined by a series of steps, processed as a Directed Acyclic Graph (DAG), consisting of three main parts: Sources, Transformations, and Outputs. You can copy and paste the following examples into the input UI to get started.
1. Data Referencing: The Core of Your Pipeline
In Power Data Transformer, you use special symbols $n
and #n
(where n is a positive integer) to refer to your data sources and the results of your transformation steps. The key is how you use them:
General Rules
-
$n
- Referencing a Source:$0
always refers to the main dataset(s) selected in the Datasets UI field.$1
,$2
, etc., refer to the data sources defined in the Sources field, in the order they are listed (1-indexed).
-
#n
- Referencing a Transformation Result:#1
,#2
, etc., refer to the result of the corresponding transformation step in the pipeline (1-indexed).
How to Use References
-
In SQL Queries: References must be wrapped in double curly braces
{{...}}
. This tells the actor to substitute the reference with the actual data table. Bare references like$1
or#2
will be treated as plain text.- Correct:
SELECT * FROM {{$0}}
- Incorrect:
SELECT * FROM $0
- Correct:
-
In Built-in transformation (
dedup
,ref
, etc.): The{{...}}
wrapper is optional forfrom
andref_table
parameters. Both of the following are valid and do the same thing:dedup: from='#1', ...
dedup: from='{{#1}}', ...
Shorthand and Default Behaviors
To make your pipelines cleaner and easier to write, Power Data Transformer includes some convenient shorthands:
$
: When used alone,$
is a shorthand that automatically resolves to the appropriate main data source:- If datasets are selected in the Datasets field,
$
refers to$0
. - If no datasets are selected,
$
refers to$1
(the first source in Additional Sources).
- If datasets are selected in the Datasets field,
#
: When used alone,#
refers to the result of the immediately preceding step. This is very useful for chaining transformations together.- Omitting
from
: In built-in transformation functions (likededup
,select
, etc.), thefrom
parameter is optional. If you omit it, it automatically defaults tofrom='#'
, using the result of the previous step as its input.
Example Illustrating the Rules:
Consider a pipeline where we want to find all recently active users and enrich their profiles with their country name.
- Datasets:
- A dataset of user profiles (referenced as
$0
).
- A dataset of user profiles (referenced as
- Sources:
- A dataset of recent user activity, containing
user_id
(referenced as$1
). - A key-value store with country codes and names (referenced as
$2
).
- A dataset of recent user activity, containing
- Transformations:
ref_in: from='$0', ref_table='$1', key_fields='id', ref_fields='user_id'
ref: ref_table='$2', key_fields='country_code', fields='country_name'
Explanation:
- Step 1 takes the main user profiles (
$0
) and filters them, keeping only the users who are also present in the recent activity dataset ($1
). The result of this step is implicitly passed to the next one. - Step 2 takes the active users from the previous step (using the default
from='#'
) and joins them with the country codes table ($2
) to add acountry_name
column. #1
could be used in a later step to refer back to the result of the first transformation (the filtered, active users).
πΌ 2. Sources: Where Your Data Comes From
Sources define the data you want to process. You can pull data from various locations, including Apify datasets, Key-Value Stores, cloud storage, or public URLs.
A single source can include multiple data files by using filters or wildcards (such as *
or ?
). All files included in one source must have the same schema type, and the system will automatically merge them into a unified table for processing.
-
Supported Sources:
dataset
,kvstore
,http
,s3
,gcs
,r2
. -
Format: Each source can specify a
format
(csv
,json
, orparquet
) to ensure the data is read correctly.- If the source name or URI already includes a file extension indicating the format, the
format
parameter can be omitted.
- If the source name or URI already includes a file extension indicating the format, the
Examples:
dataset: actor_name='apify/website-content-crawler', last='1 day'kvstore: name='reference-data', key='country-codes', format='json'
Load all datasets from the latest 1 day run of actor and assign it to $1
:
Load a specific key-value store record in JSON format and assign it to $2
:
For detailed parameters of each source type, see the Input Reference section.
π 3. Transformations: How You Process Your Data
Transformations are the core of your pipeline, where you clean, reshape, and enrich your data using SQL queries or our built-in functions (dedup
, select
, ref
, ref_in
).
This actor uses DuckDB as its SQL engine, a powerful and efficient database designed for analytical queries. This means you can leverage a rich set of SQL features and an extensive library of built-in functions to handle everything from simple data cleaning to complex aggregations and transformations.
For detailed syntax and usage, refer to the official DuckDB SQL Documentation.
Example of a multi-step transformation:
# Step 1: Clean data from the main source ($0) and create a new 'domain' column.SELECT id, TRIM(name) as name, email, SPLIT(email, '@')[2] as domainFROM {{$0}}WHERE email IS NOT NULL AND name != ''# Step 2: Deduplicate the result of the previous step based on email.# 'from' is omitted and defaults to the result of Step 1.dedup: key_fields='email', order_by='scrapedAt desc'# Step 3: Join the deduplicated data with a country reference table from the first source ($1).# 'from' is omitted and defaults to '#'.ref: ref_table='$1', key_fields='country_code', ref_fields='code', fields='country_name'
For detailed syntax, refer to the DuckDB documentation and our Input Reference.
π½ 4. Outputs: Where Your Data Goes
Outputs define the destination for your processed data. You can save the result of any transformation step to an Apify dataset, Key-Value Store, or cloud storage.
- Supported Outputs:
dataset
,kvstore
,s3
,gcs
,r2
. - Write Modes: You can
overwrite
files,append
data, or performupsert
operations for incremental updates.
Examples:
Save the final result to a new dataset named cleaned-data
:
dataset: name='cleaned-data'
Save the result of the first transformation step (#1
) to a key-value store:
kvstore: from='#1', name='intermediate-results', key='step1-output', format='parquet'
Upsert the final result into a table in an S3 bucket:
s3: uri='s3://my-bucket/processed/products.parquet', write_mode='upsert', key_fields='productId'
For detailed parameters of each output type, see the Input Reference section.
π° Usage Costs and Performance Tips
The cost of running this actor is typically just a few cents per run for simple transformations on small data. Actual costs depend on memory allocation, the number of transformation steps, and the amount of data processed (read and written). More complex pipelines or processing large volumes of data may result in higher costs. Always review your configuration and monitor usage to optimize for both performance and cost.
Pricing Breakdown
Charged Event | Description | Price (USD) |
---|---|---|
Actor Start | Fee for starting an actor, charged per 1 GB of allocated memory. | $0.005 |
Transformation Step | Fee for each transformation step, charged per 1 GB of allocated memory. | $0.01 |
Dataset Write | Fee for writing to a dataset, charged per 100 records. | $0.001 |
Key-Value Store Write | Fee for writing to a key-value store, charged per 1 MB. | $0.0002 |
Cloud Storage Write | Fee for writing to cloud storage (S3, GCS, R2), charged per 1 MB. | $0.001 |
Notes:
- Prices shown are for the free-tier plan. Other plans may offer additional discounts.
- Reading from a dataset (charged by Apify) is approximately $0.0004 per 1,000 records on the free-tier plan.
Memory Configuration Advice
Memory is the primary driver of cost and performance for this actor. The amount of memory required depends directly on the volume and complexity of the data you are processing. Here are some tips for optimizing your memory settings:
- Start Small and Observe: For a new transformation, start with a lower memory setting (e.g., 1024 MB). After the run completes, check the Max memory value in the run logs.
- Set an Appropriate Buffer: As a general rule, set the memory for your next run to be at least 1.5x to 2x the observed maximum memory usage from a similar, successful run. This provides a safe buffer for fluctuations in data size.
- Handling OOM Errors: If your actor run fails with an Out Of Memory (OOM) error, you will need to increase the memory allocation for the next run.
- Filter Early: Reduce the amount of data processed in later, more complex steps by filtering out unneeded records as early as possible in your transformation pipeline.
- Be Mindful of Complex Operations: Operations like
JOIN
s on large datasets orGROUP BY
on columns with many unique values are memory-intensive. Ensure you have allocated sufficient memory when performing these transformations.
Storage File Format Recommendation
When saving data to Key-Value Store or S3, it is recommended to use the parquet format whenever possible. Parquet is a modern, columnar storage format that significantly reduces storage size and improves read performance, making it a best practice in modern data stacks.
π¬ Feedback
Weβre always working to improve the performance of our Actors. If you have technical feedback, feature requests, or have found a bug in Power Data Transformer, please create an issue on the Actorβs Issues tab in the Apify Console.
For any questions, suggestions, or support needs, feel free to email us at info@wiseek.io.
π Input Reference
This section provides a detailed list of all supported types and parameters for each configuration stage.
πΌ Sources
Define where your data comes from in the Sources field. Each line should adhere to the following format:
<Data Source type>: <param1>='value1', <param2>='value2' ...
Example
kvstore: id='Eorbqkh71bsuARzkq', key='google-map-*', format='json'http: url='https://example.com/data.json', format='json's3: uri='s3://bucket-name/data.csv', format='csv'
Supported Source Types & Parameters
-
dataset
- Parameters:
Parameter Description id
Dataset ID name
Dataset name (e.g., username/dataset-name
)actor_id
Actor ID to fetch dataset from actor_name
Actor name to fetch dataset from task_id
Task ID to fetch dataset from task_name
Task name to fetch dataset from start_date
Start date for filtering (format: YYYY-MM-DD
orYYYY-MM-DD HH:MM:SS
)end_date
End date for filtering (format: YYYY-MM-DD
orYYYY-MM-DD HH:MM:SS
)last
Time range for filtering (e.g., '5 days', '3 hours') - Note:
id
/name
,actor_id
/actor_name
, andtask_id
/task_name
are mutually exclusive.last
andstart_date
/end_date
are mutually exclusive.
- Parameters:
-
kvstore
(orkey-value storetore
)- Parameters:
Parameter Description id
Key-value store ID name
Key-value store name key
Key of the record to fetch, wildcards *
and?
are supported.format
csv
,json
, orparquet
.... Other parameters like actor_id
,last
, etc., are similar todataset
.
- Parameters:
-
http
- Parameters:
Parameter Description url
URL of the data source. format
csv
,json
, orparquet
. Optional if the URL implies the format.
- Parameters:
-
s3
,gcs
,r2
- Parameters:
Parameter Description uri
URI of the data in cloud storage. format
csv
,json
, orparquet
. Optional if the URI implies the format. - Note: Requires credentials to be set in the Advanced Settings section.
- Parameters:
-
meta
- Description: Access internal metadata about your Apify resources, such as a list of all your datasets, actor runs, or tasks. This is useful for dynamic pipelines that need to operate on other resources.
- Example:
meta: table='datasets'
- Parameters:
Parameter Description table
The type of metadata to query. Supported values are: datasets
,runs
,tasks
,key-value storetores
,actors
.
π Transformations
Define the steps to process your data. Each line is a step in the pipeline.
Supported Transformation Types
-
SQL Query
- Description: A standard SQL query. Remember to wrap references like
{{$0}}
or{{#1}}
in curly braces. - Example:
SELECT id, name, LOWER(email) as email FROM {{$0}} WHERE name IS NOT NULL
- Description: A standard SQL query. Remember to wrap references like
-
Built-in Transformations
-
Description: Pre-defined functions to simplify common tasks.
-
Syntax:
<function_name>: from='<reference>', param1='value1', ...
-
dedup
: Removes duplicate records.- Example:
dedup: key_fields='email', order_by='createdAt desc'
- Parameters:
Parameter Description from
Optional. Source table to process (e.g., $0
,#1
). Defaults to the previous step (#
).key_fields
Comma-separated fields to identify duplicates. order_by
Field(s) to sort by to decide which record to keep. reverse
true
orfalse
. Set totrue
for descending order.
- Example:
-
select
: Selects, renames, excludes, or adds columns.- Example:
select: include='id,name', append='domain=SPLIT(email, ''@'')[2]'
- Parameters:
Parameter Description from
Optional. Source table to process. Defaults to the previous step ( #
). Bothfrom='...'
andfrom='{{...}}'
are accepted.include
Comma-separated fields to include. exclude
Comma-separated fields to exclude. replace
Fields to replace (e.g., age=age*2
).append
Fields to append (e.g., `full_name=first_name
- Example:
-
ref
: Joins with a reference table to add columns (LEFT JOIN).- Example:
ref: ref_table='{{$1}}', key_fields='country_code', ref_fields='code', fields='country_name'
- Parameters:
Parameter Description from
Optional. The main table. Defaults to the previous step ( #
).ref_table
The reference table to join with. Both ref_table='...'
andref_table='{{...}}'
are accepted.key_fields
Key(s) from the source table for joining. ref_fields
Key(s) from the reference table for joining. fields
Comma-separated fields to add from the reference table. filter
true
orfalse
. Iftrue
, acts like an INNER JOIN.
- Example:
-
ref_in
: Filters rows based on a reference table (SEMI/ANTI JOIN).- Example:
ref_in: ref_table='{{$1}}', key_fields='id', ref_fields='user_id', anti=true
- Parameters:
Parameter Description from
Optional. The table to filter. Defaults to the previous step ( #
).ref_table
The reference table to check against. Both ref_table='...'
andref_table='{{...}}'
are accepted.key_fields
Key(s) from the source table. ref_fields
Key(s) from the reference table. anti
true
orfalse
. Iftrue
, keeps only rows that do not have a match.
- Example:
-
π½ Outputs
Define where to save your transformed data. Each line should adhere to the following format:
<Output type>: <param1>='value1', <param2>='value2' ...
Example
dataset:dataset: name='cleaned-emails.json'key-value storetore: name='processed-data', key='output', format='json's3: uri='s3://bucket-name/output.csv', format='csv'
Supported Output Types & Parameters
-
dataset
- Parameters:
Parameter Description from
Optional. The transformation step to save (e.g., #2
). Defaults to the last step. Bothfrom='...'
andfrom='{{...}}'
are accepted.id
/name
The ID or name of the destination dataset. Defaults to the actor's default dataset.
- Parameters:
-
kvstore
- Parameters:
Parameter Description from
Optional. The transformation step to save. Both from='...'
andfrom='{{...}}'
are accepted.id
/name
The ID or name of the destination store. key
The key for the record. Placeholders {{current_date}}
and{{current_time}}
are supported.format
csv
,json
, orparquet
.write_mode
overwrite
(default),append
,incremental
, orupsert
.key_fields
Required for incremental
orupsert
modes.
- Parameters:
-
s3
,gcs
,r2
- Parameters:
Parameter Description from
Optional. The transformation step to save. Both from='...'
andfrom='{{...}}'
are accepted.uri
The URI for the cloud storage destination. Placeholders are supported. format
csv
,json
, orparquet
.write_mode
overwrite
(default),append
,incremental
, orupsert
.key_fields
Required for incremental
orupsert
modes.
- Parameters:
Advanced Settings
- Cloud Storage Credentials: Provide your credentials (
Key ID
,Secret
, etc.) if you are using cloud storage. - Timezone: Set the timezone for date-related operations (e.g.,
{{current_date}}
). Default isUTC
. - Debug: Enable to get more detailed logs.
On this page
Share Actor: