πŸ”₯ Power Data Transformer avatar
πŸ”₯ Power Data Transformer

Pricing

Pay per event

Go to Store
πŸ”₯ Power Data Transformer

πŸ”₯ Power Data Transformer

Developed by

wiseek

wiseek

Maintained by Community

Streamline data integration and automation. Clean, merge, deduplicate, enrich, and transform your data using flexible multi-step pipelines and the full power of SQL

0.0 (0)

Pricing

Pay per event

0

Total users

3

Monthly users

3

Runs succeeded

90%

Last modified

2 days ago

Table of Contents


πŸš€ Why Use This Actor

Power Data Transformer is a powerful and unified data transformation/automation tool designed to handle various data transformations from simple cleaning tasks to entire data-lake creation around scraped data in the Apify platform. It streamlines post-scraping data processing tasks by:

  • Cleaning and standardizing raw data
  • Merging multiple data sources
  • Enriching data with additional information
  • Supporting incremental updates
  • Managing data deduplication
  • Enabling reference data mapping
  • Multi-data source and target support
  • Multi-step pipelines and SQL-based transformation for complex data processing
  • More Built-in Transformations coming soon
  • Actor integrated ready for easy integration into existing actors

The tool is optimized for performance and scalability, ensuring that your scraped data is clean, consistent, and ready for immediate application.


πŸ’‘ Quick Start: Your First Transformation

Here’s a minimal example to run your first transformation.

1. πŸ”ΌSelect a Source Dataset Choose a dataset in the Source Datasets field. It will be referenced as $0 in your transformations.

2. πŸ”„Define a Transformation Enter a simple SQL query to select the first 10 records:

SELECT * FROM {{$0}} LIMIT 10

Note: In SQL queries, references must be wrapped in {{...}}.

3. πŸ”½Define an Output To save the result to the default dataset for this run, use:

dataset:

After running the actor, you can find the transformed data in the output dataset. For more complex scenarios, refer to the Example Use Cases as templates.


βš™οΈ How it works: understanding pipeline

Power Data Transformer is a powerful and flexible tool. This means that while it can handle very complex tasks, it might take a little time to get used to all its features. Once you're familiar with the core concepts, you'll be able to build sophisticated data pipelines with ease.

A pipeline is defined by a series of steps, processed as a Directed Acyclic Graph (DAG), consisting of three main parts: Sources, Transformations, and Outputs. You can copy and paste the following examples into the input UI to get started.

1. Data Referencing: The Core of Your Pipeline

In Power Data Transformer, you use special symbols $n and #n (where n is a positive integer) to refer to your data sources and the results of your transformation steps. The key is how you use them:

General Rules

  • $n - Referencing a Source:

    • $0 always refers to the main dataset(s) selected in the Datasets UI field.
    • $1, $2, etc., refer to the data sources defined in the Sources field, in the order they are listed (1-indexed).
  • #n - Referencing a Transformation Result:

    • #1, #2, etc., refer to the result of the corresponding transformation step in the pipeline (1-indexed).

How to Use References

  1. In SQL Queries: References must be wrapped in double curly braces {{...}}. This tells the actor to substitute the reference with the actual data table. Bare references like $1 or #2 will be treated as plain text.

    • Correct: SELECT * FROM {{$0}}
    • Incorrect: SELECT * FROM $0
  2. In Built-in transformation (dedup, ref, etc.): The {{...}} wrapper is optional for from and ref_table parameters. Both of the following are valid and do the same thing:

    • dedup: from='#1', ...
    • dedup: from='{{#1}}', ...

Shorthand and Default Behaviors

To make your pipelines cleaner and easier to write, Power Data Transformer includes some convenient shorthands:

  • $: When used alone, $ is a shorthand that automatically resolves to the appropriate main data source:
    • If datasets are selected in the Datasets field, $ refers to $0.
    • If no datasets are selected, $ refers to $1 (the first source in Additional Sources).
  • #: When used alone, # refers to the result of the immediately preceding step. This is very useful for chaining transformations together.
  • Omitting from: In built-in transformation functions (like dedup, select, etc.), the from parameter is optional. If you omit it, it automatically defaults to from='#', using the result of the previous step as its input.

Example Illustrating the Rules:

Consider a pipeline where we want to find all recently active users and enrich their profiles with their country name.

  • Datasets:
    • A dataset of user profiles (referenced as $0).
  • Sources:
    1. A dataset of recent user activity, containing user_id (referenced as $1).
    2. A key-value store with country codes and names (referenced as $2).
  • Transformations:
    1. ref_in: from='$0', ref_table='$1', key_fields='id', ref_fields='user_id'
    2. ref: ref_table='$2', key_fields='country_code', fields='country_name'

Explanation:

  • Step 1 takes the main user profiles ($0) and filters them, keeping only the users who are also present in the recent activity dataset ($1). The result of this step is implicitly passed to the next one.
  • Step 2 takes the active users from the previous step (using the default from='#') and joins them with the country codes table ($2) to add a country_name column.
  • #1 could be used in a later step to refer back to the result of the first transformation (the filtered, active users).

πŸ”Ό 2. Sources: Where Your Data Comes From

Sources define the data you want to process. You can pull data from various locations, including Apify datasets, Key-Value Stores, cloud storage, or public URLs.

A single source can include multiple data files by using filters or wildcards (such as * or ?). All files included in one source must have the same schema type, and the system will automatically merge them into a unified table for processing.

  • Supported Sources: dataset, kvstore, http, s3, gcs, r2.

  • Format: Each source can specify a format (csv, json, or parquet) to ensure the data is read correctly.

    • If the source name or URI already includes a file extension indicating the format, the format parameter can be omitted.

Examples:

dataset: actor_name='apify/website-content-crawler', last='1 day'
kvstore: name='reference-data', key='country-codes', format='json'

Load all datasets from the latest 1 day run of actor and assign it to $1: Load a specific key-value store record in JSON format and assign it to $2:

For detailed parameters of each source type, see the Input Reference section.

πŸ”„ 3. Transformations: How You Process Your Data

Transformations are the core of your pipeline, where you clean, reshape, and enrich your data using SQL queries or our built-in functions (dedup, select, ref, ref_in).

This actor uses DuckDB as its SQL engine, a powerful and efficient database designed for analytical queries. This means you can leverage a rich set of SQL features and an extensive library of built-in functions to handle everything from simple data cleaning to complex aggregations and transformations.

For detailed syntax and usage, refer to the official DuckDB SQL Documentation.

Example of a multi-step transformation:

# Step 1: Clean data from the main source ($0) and create a new 'domain' column.
SELECT id, TRIM(name) as name, email, SPLIT(email, '@')[2] as domain
FROM {{$0}}
WHERE email IS NOT NULL AND name != ''
# Step 2: Deduplicate the result of the previous step based on email.
# 'from' is omitted and defaults to the result of Step 1.
dedup: key_fields='email', order_by='scrapedAt desc'
# Step 3: Join the deduplicated data with a country reference table from the first source ($1).
# 'from' is omitted and defaults to '#'.
ref: ref_table='$1', key_fields='country_code', ref_fields='code', fields='country_name'

For detailed syntax, refer to the DuckDB documentation and our Input Reference.

πŸ”½ 4. Outputs: Where Your Data Goes

Outputs define the destination for your processed data. You can save the result of any transformation step to an Apify dataset, Key-Value Store, or cloud storage.

  • Supported Outputs: dataset, kvstore, s3, gcs, r2.
  • Write Modes: You can overwrite files, append data, or perform upsert operations for incremental updates.

Examples:

Save the final result to a new dataset named cleaned-data:

dataset: name='cleaned-data'

Save the result of the first transformation step (#1) to a key-value store:

kvstore: from='#1', name='intermediate-results', key='step1-output', format='parquet'

Upsert the final result into a table in an S3 bucket:

s3: uri='s3://my-bucket/processed/products.parquet', write_mode='upsert', key_fields='productId'

For detailed parameters of each output type, see the Input Reference section.




πŸ’° Usage Costs and Performance Tips

The cost of running this actor is typically just a few cents per run for simple transformations on small data. Actual costs depend on memory allocation, the number of transformation steps, and the amount of data processed (read and written). More complex pipelines or processing large volumes of data may result in higher costs. Always review your configuration and monitor usage to optimize for both performance and cost.

Pricing Breakdown

Charged EventDescriptionPrice (USD)
Actor StartFee for starting an actor, charged per 1 GB of allocated memory.$0.005
Transformation StepFee for each transformation step, charged per 1 GB of allocated memory.$0.01
Dataset WriteFee for writing to a dataset, charged per 100 records.$0.001
Key-Value Store WriteFee for writing to a key-value store, charged per 1 MB.$0.0002
Cloud Storage WriteFee for writing to cloud storage (S3, GCS, R2), charged per 1 MB.$0.001

Notes:

  • Prices shown are for the free-tier plan. Other plans may offer additional discounts.
  • Reading from a dataset (charged by Apify) is approximately $0.0004 per 1,000 records on the free-tier plan.

Memory Configuration Advice

Memory is the primary driver of cost and performance for this actor. The amount of memory required depends directly on the volume and complexity of the data you are processing. Here are some tips for optimizing your memory settings:

  • Start Small and Observe: For a new transformation, start with a lower memory setting (e.g., 1024 MB). After the run completes, check the Max memory value in the run logs.
  • Set an Appropriate Buffer: As a general rule, set the memory for your next run to be at least 1.5x to 2x the observed maximum memory usage from a similar, successful run. This provides a safe buffer for fluctuations in data size.
  • Handling OOM Errors: If your actor run fails with an Out Of Memory (OOM) error, you will need to increase the memory allocation for the next run.
  • Filter Early: Reduce the amount of data processed in later, more complex steps by filtering out unneeded records as early as possible in your transformation pipeline.
  • Be Mindful of Complex Operations: Operations like JOINs on large datasets or GROUP BY on columns with many unique values are memory-intensive. Ensure you have allocated sufficient memory when performing these transformations.

Storage File Format Recommendation

When saving data to Key-Value Store or S3, it is recommended to use the parquet format whenever possible. Parquet is a modern, columnar storage format that significantly reduces storage size and improves read performance, making it a best practice in modern data stacks.


πŸ’¬ Feedback

We’re always working to improve the performance of our Actors. If you have technical feedback, feature requests, or have found a bug in Power Data Transformer, please create an issue on the Actor’s Issues tab in the Apify Console.

For any questions, suggestions, or support needs, feel free to email us at info@wiseek.io.

πŸ“š Input Reference

This section provides a detailed list of all supported types and parameters for each configuration stage.

πŸ”Ό Sources

Define where your data comes from in the Sources field. Each line should adhere to the following format:

<Data Source type>: <param1>='value1', <param2>='value2' ...

Example

kvstore: id='Eorbqkh71bsuARzkq', key='google-map-*', format='json'
http: url='https://example.com/data.json', format='json'
s3: uri='s3://bucket-name/data.csv', format='csv'

Supported Source Types & Parameters

  1. dataset

    • Parameters:
      ParameterDescription
      idDataset ID
      nameDataset name (e.g., username/dataset-name)
      actor_idActor ID to fetch dataset from
      actor_nameActor name to fetch dataset from
      task_idTask ID to fetch dataset from
      task_nameTask name to fetch dataset from
      start_dateStart date for filtering (format: YYYY-MM-DD or YYYY-MM-DD HH:MM:SS)
      end_dateEnd date for filtering (format: YYYY-MM-DD or YYYY-MM-DD HH:MM:SS)
      lastTime range for filtering (e.g., '5 days', '3 hours')
    • Note:
      • id/name, actor_id/actor_name, and task_id/task_name are mutually exclusive.
      • last and start_date/end_date are mutually exclusive.
  2. kvstore (or key-value storetore)

    • Parameters:
      ParameterDescription
      idKey-value store ID
      nameKey-value store name
      keyKey of the record to fetch, wildcards * and ? are supported.
      formatcsv, json, or parquet.
      ...Other parameters like actor_id, last, etc., are similar to dataset.
  3. http

    • Parameters:
      ParameterDescription
      urlURL of the data source.
      formatcsv, json, or parquet. Optional if the URL implies the format.
  4. s3, gcs, r2

    • Parameters:
      ParameterDescription
      uriURI of the data in cloud storage.
      formatcsv, json, or parquet. Optional if the URI implies the format.
    • Note: Requires credentials to be set in the Advanced Settings section.
  5. meta

    • Description: Access internal metadata about your Apify resources, such as a list of all your datasets, actor runs, or tasks. This is useful for dynamic pipelines that need to operate on other resources.
    • Example: meta: table='datasets'
    • Parameters:
      ParameterDescription
      tableThe type of metadata to query. Supported values are: datasets, runs, tasks, key-value storetores, actors.

πŸ”„ Transformations

Define the steps to process your data. Each line is a step in the pipeline.

Supported Transformation Types

  1. SQL Query

    • Description: A standard SQL query. Remember to wrap references like {{$0}} or {{#1}} in curly braces.
    • Example: SELECT id, name, LOWER(email) as email FROM {{$0}} WHERE name IS NOT NULL
  2. Built-in Transformations

    • Description: Pre-defined functions to simplify common tasks.

    • Syntax: <function_name>: from='<reference>', param1='value1', ...

    • dedup: Removes duplicate records.

      • Example: dedup: key_fields='email', order_by='createdAt desc'
      • Parameters:
        ParameterDescription
        fromOptional. Source table to process (e.g., $0, #1). Defaults to the previous step (#).
        key_fieldsComma-separated fields to identify duplicates.
        order_byField(s) to sort by to decide which record to keep.
        reversetrue or false. Set to true for descending order.
    • select: Selects, renames, excludes, or adds columns.

      • Example: select: include='id,name', append='domain=SPLIT(email, ''@'')[2]'
      • Parameters:
        ParameterDescription
        fromOptional. Source table to process. Defaults to the previous step (#). Both from='...' and from='{{...}}' are accepted.
        includeComma-separated fields to include.
        excludeComma-separated fields to exclude.
        replaceFields to replace (e.g., age=age*2).
        appendFields to append (e.g., `full_name=first_name
    • ref: Joins with a reference table to add columns (LEFT JOIN).

      • Example: ref: ref_table='{{$1}}', key_fields='country_code', ref_fields='code', fields='country_name'
      • Parameters:
        ParameterDescription
        fromOptional. The main table. Defaults to the previous step (#).
        ref_tableThe reference table to join with. Both ref_table='...' and ref_table='{{...}}' are accepted.
        key_fieldsKey(s) from the source table for joining.
        ref_fieldsKey(s) from the reference table for joining.
        fieldsComma-separated fields to add from the reference table.
        filtertrue or false. If true, acts like an INNER JOIN.
    • ref_in: Filters rows based on a reference table (SEMI/ANTI JOIN).

      • Example: ref_in: ref_table='{{$1}}', key_fields='id', ref_fields='user_id', anti=true
      • Parameters:
        ParameterDescription
        fromOptional. The table to filter. Defaults to the previous step (#).
        ref_tableThe reference table to check against. Both ref_table='...' and ref_table='{{...}}' are accepted.
        key_fieldsKey(s) from the source table.
        ref_fieldsKey(s) from the reference table.
        antitrue or false. If true, keeps only rows that do not have a match.

πŸ”½ Outputs

Define where to save your transformed data. Each line should adhere to the following format:

<Output type>: <param1>='value1', <param2>='value2' ...

Example

dataset:
dataset: name='cleaned-emails.json'
key-value storetore: name='processed-data', key='output', format='json'
s3: uri='s3://bucket-name/output.csv', format='csv'

Supported Output Types & Parameters

  1. dataset

    • Parameters:
      ParameterDescription
      fromOptional. The transformation step to save (e.g., #2). Defaults to the last step. Both from='...' and from='{{...}}' are accepted.
      id / nameThe ID or name of the destination dataset. Defaults to the actor's default dataset.
  2. kvstore

    • Parameters:
      ParameterDescription
      fromOptional. The transformation step to save. Both from='...' and from='{{...}}' are accepted.
      id / nameThe ID or name of the destination store.
      keyThe key for the record. Placeholders {{current_date}} and {{current_time}} are supported.
      formatcsv, json, or parquet.
      write_modeoverwrite (default), append, incremental, or upsert.
      key_fieldsRequired for incremental or upsert modes.
  3. s3, gcs, r2

    • Parameters:
      ParameterDescription
      fromOptional. The transformation step to save. Both from='...' and from='{{...}}' are accepted.
      uriThe URI for the cloud storage destination. Placeholders are supported.
      formatcsv, json, or parquet.
      write_modeoverwrite (default), append, incremental, or upsert.
      key_fieldsRequired for incremental or upsert modes.

Advanced Settings

  • Cloud Storage Credentials: Provide your credentials (Key ID, Secret, etc.) if you are using cloud storage.
  • Timezone: Set the timezone for date-related operations (e.g., {{current_date}}). Default is UTC.
  • Debug: Enable to get more detailed logs.