Polars AI Data Transformer avatar
Polars AI Data Transformer

Pricing

Pay per event

Go to Apify Store
Polars AI Data Transformer

Polars AI Data Transformer

Transform datasets using natural language. Upload CSV/Excel/JSON, describe your transformation in plain English, get results + reusable Python code. Powered by AI.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Salesmart Srl

Salesmart Srl

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

AI Data Transformer

Transform any dataset using plain English. No coding required.

Describe what you want in natural language, get transformed data + reusable Python code.


Table of Contents

  1. Quick Start
  2. Getting Your Apify API Token
  3. Pricing
  4. Choosing the Right Mode
  5. 4 Operating Modes
  6. Complete Input Options
  7. How It Works
  8. Writing Effective Prompts
  9. API Examples
  10. Output Format

Quick Start

Option 1: With file URL

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetUrls": ["https://example.com/data.csv"],
"prompt": "Group by country and sum sales"
}'

Option 2: Direct JSON data (no file needed!)

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"inputData": [
{"product": "iPhone", "price": 999, "qty": 10},
{"product": "iPad", "price": 799, "qty": 5}
],
"prompt": "Calculate total value (price * qty) for each product"
}'

Response includes output_data with all transformed rows directly in JSON.

That's it. No LLM API key needed for Basic mode.


Getting Your Apify API Token

To use this Actor via API, you need an Apify API token.

Step 1: Create Apify Account

Go to apify.com and sign up (free).

Step 2: Get Your API Token

  1. Log in to Apify Console
  2. Click your profile icon (top right)
  3. Go to SettingsIntegrations
  4. Copy your Personal API Token

Your token looks like: apify_api_xxxxxxxxxxxxxxxxxxxxx

Step 3: Use the Token

Option A: Query parameter

https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN

Option B: Authorization header

Authorization: Bearer YOUR_TOKEN

Free Tier

Apify offers $5 free credits monthly. Basic transformations cost ~$0.0015 each, so you get ~3,000 free transformations per month.


Pricing

Pay-per-event pricing. You only pay when a transformation runs successfully.

ModeApify FeeLLM CostTotal Cost
Basic$0.0015Included$0.0015
Premium$0.20Included$0.20
BYOK$0.001Your API$0.001 + API
BYOK Premium$0.001Your API$0.001 + API

Volume Discounts

TierBYOKBasicPremium
No Discount$0.001$0.0015$0.20
Bronze$0.0007$0.00117$0.167
Silver$0.0004$0.00083$0.133
Gold$0.0001$0.0005$0.10

Choosing the Right Mode

Decision Tree

Do you need Google Search grounding or complex reasoning?
├─ NO → Do you have your own LLM API key?
│ │
│ ├─ NO → Use BASIC ($0.0015)
│ │ Simple, fast, no setup
│ │
│ └─ YES → Use BYOK ($0.001 + your API)
│ Lowest cost if you have free API credits
└─ YES → Do you have a Google API key?
├─ NO → Use PREMIUM ($0.20)
│ All features included, no setup
└─ YES → Use BYOK PREMIUM ($0.001 + your API)
Same features as Premium, use your credits

Mode Comparison

FeatureBasicPremiumBYOKBYOK Premium
Simple aggregationsYesYesYesYes
Filtering & sortingYesYesYesYes
Data cleaningYesYesYesYes
E-commerce migrationsLimitedBestLimitedBest
Google Search groundingNoYesNoYes
Extended thinking/reasoningNoYesNoYes
RAG memory (learns over time)YesYesYesYes
Requires LLM API keyNoNoYesYes
Cost$0.0015$0.20$0.001+API$0.001+API

Premium vs BYOK Premium: What's the Difference?

Nothing, except billing.

Both modes use:

  • Gemini 2.5 Pro with extended thinking
  • Google Search grounding
  • RAG memory system

The only difference:

  • Premium: We pay Google, you pay us $0.20
  • BYOK Premium: You pay Google directly, you pay us $0.001

When to use BYOK Premium:

  • You have Google Cloud credits
  • You have an enterprise Google agreement
  • You want to track API usage in your own Google console

4 Operating Modes

Mode 1: Basic (Hosted)

Use when: Simple transformations, high volume, budget-conscious

{
"datasetUrls": ["https://example.com/data.csv"],
"prompt": "Group by country and sum sales, show top 10"
}

What you get:

  • Gemini 2.5 Flash-Lite (fast, efficient)
  • No API key required
  • $0.0015 per transformation

Good for:

  • Aggregations (sum, count, average)
  • Filtering and sorting
  • Basic calculations
  • Data reformatting

Not ideal for:

  • "Convert to Shopify format" (doesn't know Shopify schema)
  • Complex multi-step reasoning

Mode 2: Premium (Hosted)

Use when: Complex transformations, e-commerce migrations, need accuracy

{
"datasetUrls": ["https://example.com/magento-products.csv"],
"prompt": "Transform to Shopify product import format",
"useAdvancedFeatures": true
}

What you get:

  • Gemini 2.5 Pro (most capable model)
  • Extended thinking (reasons through complex problems)
  • Google Search grounding (knows external formats)
  • RAG memory (improves over time)
  • No API key required
  • $0.20 per transformation

Good for:

  • E-commerce platform migrations (Magento→Shopify, etc.)
  • Format conversions (to Stripe, Mailchimp, etc.)
  • Complex multi-step transformations
  • Tasks requiring external knowledge

Why it costs more:

  • Uses Gemini Pro (~$1.25/1M tokens)
  • Google Search queries (~$35/1K queries)
  • We bundle these costs into a flat $0.20 fee

Mode 3: BYOK (Bring Your Own Key)

Use when: You have LLM API credits, want lowest cost

{
"datasetUrls": ["https://example.com/data.csv"],
"prompt": "Filter active users and calculate totals",
"llmProvider": "groq",
"groqApiKey": "gsk_..."
}

What you get:

  • Your choice of LLM provider
  • RAG memory (improves over time)
  • $0.001 Apify fee + your API costs

Supported providers:

ProviderModelAPI CostGet Key
GroqLlama 3.3 70BFREE tierconsole.groq.com
GoogleGemini 2.0 Flash~$0.10/1M tokensaistudio.google.com
OpenAIGPT-4o~$5/1M tokensplatform.openai.com
AnthropicClaude Sonnet 4~$3/1M tokensconsole.anthropic.com

Recommended: Groq (FREE)

Groq offers a generous free tier. Combined with our $0.001 fee, you can run thousands of transformations for almost nothing.


Mode 4: BYOK Premium

Use when: You have Google API credits AND need Premium features

{
"datasetUrls": ["https://example.com/products.csv"],
"prompt": "Convert to Shopify product CSV format",
"llmProvider": "google",
"googleApiKey": "AIza...",
"useAdvancedFeatures": true
}

What you get:

  • Same as Premium: Gemini Pro + Google Search + RAG
  • Uses YOUR Google API key
  • $0.001 Apify fee + your Google API costs

Your Google API costs:

  • Gemini 2.5 Pro: ~$1.25/1M input, ~$5/1M output tokens
  • Google Search grounding: ~$35 per 1,000 queries

Why use this instead of Premium?

  • You have Google Cloud credits to use up
  • Your company has a Google enterprise agreement
  • You want API usage in your own Google console
  • You're doing very high volume and want direct billing

Complete Input Options

Required

FieldTypeDescription
promptstringNatural language description of transformation

Data Sources (at least one required)

FieldTypeDescription
inputDataarrayDirect JSON data - no file hosting needed!
datasetUrlsstring[]URLs to data files (CSV, Excel, JSON, Parquet)
uploadedFilesfile[]Direct file uploads via Apify Console
apifyDatasetIdstringID of existing Apify dataset

Recommended: inputData for API integrations - single call with data in, results out.

Mode Selection

FieldTypeDefaultDescription
useAdvancedFeaturesbooleanfalseEnable Premium features (reasoning + grounding)
llmProviderstring-BYOK provider: groq, google, openai, anthropic
groqApiKeystring-Your Groq API key
googleApiKeystring-Your Google API key
openaiApiKeystring-Your OpenAI API key
anthropicApiKeystring-Your Anthropic API key

Output Options

FieldTypeDefaultDescription
outputFormatstringcsvOutput format: csv, json, parquet, xlsx
includeGeneratedCodebooleantrueInclude Python code in output
maxRetriesnumber3Max code generation retry attempts

Mode Selection Logic

IF llmProvider is set AND corresponding API key is provided:
IF useAdvancedFeatures is true AND llmProvider is "google":
BYOK PREMIUM (your Google key + Premium features)
ELSE:
BYOK (your key, basic features)
ELSE:
IF useAdvancedFeatures is true:
PREMIUM (hosted, $0.20)
ELSE:
BASIC (hosted, $0.0015)

How It Works

Processing Pipeline

1. INPUT VALIDATION
├─ Parse prompt and options
├─ Detect mode (Basic/Premium/BYOK)
└─ Validate data URLs
2. DATA LOADING
├─ Load from inputData (direct JSON - zero I/O!)
├─ Or fetch from URLs (CSV, Excel, JSON, Parquet)
├─ Auto-detect format and encoding
├─ Extract schema (column names, types, sample values)
└─ Handle multiple sources (auto-merge)
3. RAG SEARCH
├─ Search Pinecone for similar past transformations
├─ If found (>85% similarity), include as context
└─ Helps LLM generate better code
4. CODE GENERATION
├─ Send prompt + schema + RAG context to LLM
├─ LLM generates Polars transformation code
└─ Validate code structure
5. EXECUTION
├─ Execute code in sandboxed environment
├─ Validate output (no empty results, correct types)
└─ Retry if errors (up to maxRetries)
6. OUTPUT
├─ Export transformed data (CSV/JSON/Parquet/Excel)
├─ Save generated code
└─ Return metadata (rows, timing, etc.)
7. LEARNING
├─ Save successful transformation to Pinecone
└─ Future similar requests benefit from this

RAG Memory System

The system learns from every successful transformation:

  1. Before generation: Searches for similar prompts in Pinecone
  2. If found: Includes similar code as context for better results
  3. After success: Saves the new transformation
  4. Over time: Accuracy improves as memory grows

Current memory: 22+ successful transformations and growing.


Writing Effective Prompts

Structure

[ACTION] + [COLUMNS] + [CONDITIONS] + [OUTPUT]

Examples by Complexity

Simple (use Basic):

Group by 'region' column, sum 'revenue', sort descending, top 10

Medium (use Basic or Premium):

Filter rows where status is 'active' and created_at > 2024-01-01,
calculate total and average order_value per customer

Complex (use Premium):

Convert Magento 2 product export to Shopify CSV format:
- sku -> Handle (lowercase, replace spaces with dashes)
- name -> Title
- description -> Body (HTML)
- price -> Variant Price
- qty -> Variant Inventory Qty
- product_online -> Published (1=true, 0=false)
Only include simple products (exclude configurable/bundle)
Add Vendor column with value "Imported from Magento"

Tips

DoDon't
Name specific columnsSay "transform the data"
Specify output formatAssume system knows your schema
Use Premium for migrationsUse Basic for Shopify/Stripe formats
Break complex tasks into stepsWrite 500-word prompts

API Examples

Single call with data in, results out. No file hosting needed.

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/run-sync-get-dataset-items?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"inputData": [
{"product": "iPhone", "price": 999, "quantity": 10},
{"product": "iPad", "price": 799, "quantity": 5},
{"product": "MacBook", "price": 1999, "quantity": 3}
],
"prompt": "Calculate total_value = price * quantity, sort by total_value descending",
"outputFormat": "json"
}'

Response includes output_data array with all transformed rows.

Basic Mode (with URL)

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetUrls": ["https://example.com/sales.csv"],
"prompt": "Group by region, sum revenue, sort descending"
}'

Premium Mode

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetUrls": ["https://example.com/magento-products.csv"],
"prompt": "Transform to Shopify product import CSV format",
"useAdvancedFeatures": true
}'

BYOK Mode (Groq - FREE)

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetUrls": ["https://example.com/data.csv"],
"prompt": "Calculate monthly trends",
"llmProvider": "groq",
"groqApiKey": "gsk_xxxxx"
}'

BYOK Premium Mode

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"datasetUrls": ["https://example.com/products.csv"],
"prompt": "Convert to Shopify format with all required columns",
"llmProvider": "google",
"googleApiKey": "AIza_xxxxx",
"useAdvancedFeatures": true
}'

Python SDK

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
# Direct JSON input (recommended)
run = client.actor("salesmart-srl/polars-ai-data-transformer").call(
run_input={
"inputData": [
{"product": "iPhone", "price": 999, "qty": 10},
{"product": "iPad", "price": 799, "qty": 5},
],
"prompt": "Calculate total = price * qty, sort descending",
}
)
# Get results from dataset (includes output_data)
dataset = client.dataset(run["defaultDatasetId"])
items = list(dataset.iterate_items())
result = items[0]
print(f"Status: {result['status']}")
print(f"Output rows: {result['output_rows']}")
print(f"Transformed data: {result['output_data']}") # Full data!
print(f"Generated code: {result['generated_code']}")
# With file URL
run = client.actor("salesmart-srl/polars-ai-data-transformer").call(
run_input={
"datasetUrls": ["https://example.com/data.csv"],
"prompt": "Group by category and sum sales",
}
)
# Premium transformation
run = client.actor("salesmart-srl/polars-ai-data-transformer").call(
run_input={
"datasetUrls": ["https://example.com/products.csv"],
"prompt": "Convert to Shopify product CSV",
"useAdvancedFeatures": True,
}
)

Output Format

Response Structure

{
"status": "success",
"input_sources_count": 1,
"input_rows_total": 1000,
"input_columns": ["sku", "name", "price", "qty"],
"output_rows": 50,
"output_columns": ["Handle", "Title", "Variant Price"],
"output_file": "transformed_data.csv",
"execution_time_ms": 1234,
"generation_info": {
"provider": "google_pro",
"tokens_used": 4500,
"generation_time_ms": 890,
"attempts": 1
},
"generated_code": "import polars as pl\n\nresult = ...",
"output_preview": [
{"Handle": "product-1", "Title": "Product One", "Variant Price": 29.99}
],
"output_data": [
{"Handle": "product-1", "Title": "Product One", "Variant Price": 29.99},
{"Handle": "product-2", "Title": "Product Two", "Variant Price": 49.99}
],
"warnings": [],
"errors": []
}

Output Fields

FieldDescription
output_previewFirst 10 rows (always present)
output_dataFull transformed data (if < 10MB) - use this for API integrations!
output_fileFilename in Key-Value Store (for large files)

Generated Code

Every transformation returns reusable Python code:

import polars as pl
# Load your data
df = pl.read_csv("your_data.csv")
# Generated transformation (copy this!)
result = (
df.lazy()
.filter(pl.col("status") == "active")
.group_by("region")
.agg(
pl.col("revenue").sum().alias("total_revenue"),
pl.col("orders").count().alias("order_count")
)
.sort("total_revenue", descending=True)
.head(10)
.collect()
)
# Save
result.write_csv("output.csv")

Performance

  • Handles millions of rows efficiently
  • Typical transformation: 1-3 seconds
  • Uses Polars (Rust-based, 10-100x faster than Pandas)
  • Lazy evaluation for memory efficiency
  • Parallel processing for multi-file inputs

Privacy and Security

  • Encrypted: API keys encrypted with AES-256
  • Isolated: Data processed in isolated containers
  • No retention: Data deleted after run completion
  • No training: Your data is never used to train models
  • BYOK: Full control over your LLM API keys

Support


Changelog

v0.4 (December 2024)

  • NEW: inputData - Pass data directly as JSON, no file hosting needed
  • NEW: output_data - Full transformed data in response (if < 10MB)
  • Single API call: data in, results out
  • Perfect for API integrations and automation

v0.3 (December 2024)

  • Migrated to google-genai SDK
  • ThinkingConfig for extended reasoning
  • Improved Google Search grounding
  • Code cleanup and optimization

v0.2 (December 2024)

  • 4-tier pricing: Basic, Premium, BYOK, BYOK Premium
  • Premium: Gemini Pro + Google Search + RAG
  • RAG system with Pinecone
  • Multi-file support

v0.1 (December 2024)

  • Initial release
  • Multi-provider LLM support
  • CSV, Excel, JSON, Parquet I/O