Pricing

Pay per event

Polars AI Data Transformer

Transform datasets using natural language. Upload CSV/Excel/JSON, describe your transformation in plain English, get results + reusable Python code. Powered by AI.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Salesmart Srl

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

AI Data Transformer

Transform any dataset using plain English. No coding required.

Describe what you want in natural language, get transformed data + reusable Python code.

Quick Start
Getting Your Apify API Token
Pricing
Choosing the Right Mode
4 Operating Modes
Complete Input Options
How It Works
Writing Effective Prompts
API Examples
Output Format

Quick Start

Option 1: With file URL

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetUrls": ["https://example.com/data.csv"],
    "prompt": "Group by country and sum sales"
  }'

Option 2: Direct JSON data (no file needed!)

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/run-sync-get-dataset-items?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputData": [
      {"product": "iPhone", "price": 999, "qty": 10},
      {"product": "iPad", "price": 799, "qty": 5}
    ],
    "prompt": "Calculate total value (price * qty) for each product"
  }'

Response includes output_data with all transformed rows directly in JSON.

That's it. No LLM API key needed for Basic mode.

Getting Your Apify API Token

To use this Actor via API, you need an Apify API token.

Step 1: Create Apify Account

Go to apify.com and sign up (free).

Step 2: Get Your API Token

Log in to Apify Console
Click your profile icon (top right)
Go to Settings → Integrations
Copy your Personal API Token

Your token looks like: apify_api_xxxxxxxxxxxxxxxxxxxxx

Step 3: Use the Token

Option A: Query parameter

https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN

Option B: Authorization header

Authorization: Bearer YOUR_TOKEN

Free Tier

Apify offers $5 free credits monthly. Basic transformations cost ~$0.0015 each, so you get ~3,000 free transformations per month.

Pricing

Pay-per-event pricing. You only pay when a transformation runs successfully.

Mode	Apify Fee	LLM Cost	Total Cost
Basic	$0.0015	Included	$0.0015
Premium	$0.20	Included	$0.20
BYOK	$0.001	Your API	$0.001 + API
BYOK Premium	$0.001	Your API	$0.001 + API

Volume Discounts

Tier	BYOK	Basic	Premium
No Discount	$0.001	$0.0015	$0.20
Bronze	$0.0007	$0.00117	$0.167
Silver	$0.0004	$0.00083	$0.133
Gold	$0.0001	$0.0005	$0.10

Choosing the Right Mode

Decision Tree

Do you need Google Search grounding or complex reasoning?
│
├─ NO → Do you have your own LLM API key?
│       │
│       ├─ NO → Use BASIC ($0.0015)
│       │       Simple, fast, no setup
│       │
│       └─ YES → Use BYOK ($0.001 + your API)
│                Lowest cost if you have free API credits
│
└─ YES → Do you have a Google API key?
         │
         ├─ NO → Use PREMIUM ($0.20)
         │       All features included, no setup
         │
         └─ YES → Use BYOK PREMIUM ($0.001 + your API)
                  Same features as Premium, use your credits

Mode Comparison

Feature	Basic	Premium	BYOK	BYOK Premium
Simple aggregations	Yes	Yes	Yes	Yes
Filtering & sorting	Yes	Yes	Yes	Yes
Data cleaning	Yes	Yes	Yes	Yes
E-commerce migrations	Limited	Best	Limited	Best
Google Search grounding	No	Yes	No	Yes
Extended thinking/reasoning	No	Yes	No	Yes
RAG memory (learns over time)	Yes	Yes	Yes	Yes
Requires LLM API key	No	No	Yes	Yes
Cost	$0.0015	$0.20	$0.001+API	$0.001+API

Premium vs BYOK Premium: What's the Difference?

Nothing, except billing.

Both modes use:

Gemini 2.5 Pro with extended thinking
Google Search grounding
RAG memory system

The only difference:

Premium: We pay Google, you pay us $0.20
BYOK Premium: You pay Google directly, you pay us $0.001

When to use BYOK Premium:

You have Google Cloud credits
You have an enterprise Google agreement
You want to track API usage in your own Google console

4 Operating Modes

Mode 1: Basic (Hosted)

Use when: Simple transformations, high volume, budget-conscious

{
  "datasetUrls": ["https://example.com/data.csv"],
  "prompt": "Group by country and sum sales, show top 10"
}

What you get:

Gemini 2.5 Flash-Lite (fast, efficient)
No API key required
$0.0015 per transformation

Good for:

Aggregations (sum, count, average)
Filtering and sorting
Basic calculations
Data reformatting

Not ideal for:

"Convert to Shopify format" (doesn't know Shopify schema)
Complex multi-step reasoning

Mode 2: Premium (Hosted)

Use when: Complex transformations, e-commerce migrations, need accuracy

{
  "datasetUrls": ["https://example.com/magento-products.csv"],
  "prompt": "Transform to Shopify product import format",
  "useAdvancedFeatures": true
}

What you get:

Gemini 2.5 Pro (most capable model)
Extended thinking (reasons through complex problems)
Google Search grounding (knows external formats)
RAG memory (improves over time)
No API key required
$0.20 per transformation

Good for:

E-commerce platform migrations (Magento→Shopify, etc.)
Format conversions (to Stripe, Mailchimp, etc.)
Complex multi-step transformations
Tasks requiring external knowledge

Why it costs more:

Uses Gemini Pro (~$1.25/1M tokens)
Google Search queries (~$35/1K queries)
We bundle these costs into a flat $0.20 fee

Mode 3: BYOK (Bring Your Own Key)

Use when: You have LLM API credits, want lowest cost

{
  "datasetUrls": ["https://example.com/data.csv"],
  "prompt": "Filter active users and calculate totals",
  "llmProvider": "groq",
  "groqApiKey": "gsk_..."
}

What you get:

Your choice of LLM provider
RAG memory (improves over time)
$0.001 Apify fee + your API costs

Supported providers:

Provider	Model	API Cost	Get Key
Groq	Llama 3.3 70B	FREE tier	console.groq.com
Google	Gemini 2.0 Flash	~$0.10/1M tokens	aistudio.google.com
OpenAI	GPT-4o	~$5/1M tokens	platform.openai.com
Anthropic	Claude Sonnet 4	~$3/1M tokens	console.anthropic.com

Recommended: Groq (FREE)

Groq offers a generous free tier. Combined with our $0.001 fee, you can run thousands of transformations for almost nothing.

Mode 4: BYOK Premium

Use when: You have Google API credits AND need Premium features

{
  "datasetUrls": ["https://example.com/products.csv"],
  "prompt": "Convert to Shopify product CSV format",
  "llmProvider": "google",
  "googleApiKey": "AIza...",
  "useAdvancedFeatures": true
}

What you get:

Same as Premium: Gemini Pro + Google Search + RAG
Uses YOUR Google API key
$0.001 Apify fee + your Google API costs

Your Google API costs:

Gemini 2.5 Pro: ~$1.25/1M input, ~$5/1M output tokens
Google Search grounding: ~$35 per 1,000 queries

Why use this instead of Premium?

You have Google Cloud credits to use up
Your company has a Google enterprise agreement
You want API usage in your own Google console
You're doing very high volume and want direct billing

Complete Input Options

Required

Field	Type	Description
`prompt`	string	Natural language description of transformation

Data Sources (at least one required)

Field	Type	Description
`inputData`	array	Direct JSON data - no file hosting needed!
`datasetUrls`	string[]	URLs to data files (CSV, Excel, JSON, Parquet)
`uploadedFiles`	file[]	Direct file uploads via Apify Console
`apifyDatasetId`	string	ID of existing Apify dataset

Recommended: inputData for API integrations - single call with data in, results out.

Mode Selection

Field	Type	Default	Description
`useAdvancedFeatures`	boolean	`false`	Enable Premium features (reasoning + grounding)
`llmProvider`	string	-	BYOK provider: `groq`, `google`, `openai`, `anthropic`
`groqApiKey`	string	-	Your Groq API key
`googleApiKey`	string	-	Your Google API key
`openaiApiKey`	string	-	Your OpenAI API key
`anthropicApiKey`	string	-	Your Anthropic API key

Output Options

Field	Type	Default	Description
`outputFormat`	string	`csv`	Output format: `csv`, `json`, `parquet`, `xlsx`
`includeGeneratedCode`	boolean	`true`	Include Python code in output
`maxRetries`	number	`3`	Max code generation retry attempts

Mode Selection Logic

IF llmProvider is set AND corresponding API key is provided:
    IF useAdvancedFeatures is true AND llmProvider is "google":
        → BYOK PREMIUM (your Google key + Premium features)
    ELSE:
        → BYOK (your key, basic features)
ELSE:
    IF useAdvancedFeatures is true:
        → PREMIUM (hosted, $0.20)
    ELSE:
        → BASIC (hosted, $0.0015)

How It Works

Processing Pipeline

1. INPUT VALIDATION
   ├─ Parse prompt and options
   ├─ Detect mode (Basic/Premium/BYOK)
   └─ Validate data URLs

2. DATA LOADING
   ├─ Load from inputData (direct JSON - zero I/O!)
   ├─ Or fetch from URLs (CSV, Excel, JSON, Parquet)
   ├─ Auto-detect format and encoding
   ├─ Extract schema (column names, types, sample values)
   └─ Handle multiple sources (auto-merge)

3. RAG SEARCH
   ├─ Search Pinecone for similar past transformations
   ├─ If found (>85% similarity), include as context
   └─ Helps LLM generate better code

4. CODE GENERATION
   ├─ Send prompt + schema + RAG context to LLM
   ├─ LLM generates Polars transformation code
   └─ Validate code structure

5. EXECUTION
   ├─ Execute code in sandboxed environment
   ├─ Validate output (no empty results, correct types)
   └─ Retry if errors (up to maxRetries)

6. OUTPUT
   ├─ Export transformed data (CSV/JSON/Parquet/Excel)
   ├─ Save generated code
   └─ Return metadata (rows, timing, etc.)

7. LEARNING
   ├─ Save successful transformation to Pinecone
   └─ Future similar requests benefit from this

RAG Memory System

The system learns from every successful transformation:

Before generation: Searches for similar prompts in Pinecone
If found: Includes similar code as context for better results
After success: Saves the new transformation
Over time: Accuracy improves as memory grows

Current memory: 22+ successful transformations and growing.

Writing Effective Prompts

Structure

[ACTION] + [COLUMNS] + [CONDITIONS] + [OUTPUT]

Examples by Complexity

Simple (use Basic):

Group by 'region' column, sum 'revenue', sort descending, top 10

Medium (use Basic or Premium):

Filter rows where status is 'active' and created_at > 2024-01-01,
calculate total and average order_value per customer

Complex (use Premium):

Convert Magento 2 product export to Shopify CSV format:
- sku -> Handle (lowercase, replace spaces with dashes)
- name -> Title
- description -> Body (HTML)
- price -> Variant Price
- qty -> Variant Inventory Qty
- product_online -> Published (1=true, 0=false)
Only include simple products (exclude configurable/bundle)
Add Vendor column with value "Imported from Magento"

Tips

Do	Don't
Name specific columns	Say "transform the data"
Specify output format	Assume system knows your schema
Use Premium for migrations	Use Basic for Shopify/Stripe formats
Break complex tasks into steps	Write 500-word prompts

API Examples

Direct JSON Input (Recommended for API)

Single call with data in, results out. No file hosting needed.

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "inputData": [
      {"product": "iPhone", "price": 999, "quantity": 10},
      {"product": "iPad", "price": 799, "quantity": 5},
      {"product": "MacBook", "price": 1999, "quantity": 3}
    ],
    "prompt": "Calculate total_value = price * quantity, sort by total_value descending",
    "outputFormat": "json"
  }'

Response includes output_data array with all transformed rows.

Basic Mode (with URL)

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetUrls": ["https://example.com/sales.csv"],
    "prompt": "Group by region, sum revenue, sort descending"
  }'

Premium Mode

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetUrls": ["https://example.com/magento-products.csv"],
    "prompt": "Transform to Shopify product import CSV format",
    "useAdvancedFeatures": true
  }'

BYOK Mode (Groq - FREE)

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetUrls": ["https://example.com/data.csv"],
    "prompt": "Calculate monthly trends",
    "llmProvider": "groq",
    "groqApiKey": "gsk_xxxxx"
  }'

BYOK Premium Mode

curl -X POST "https://api.apify.com/v2/acts/salesmart-srl~polars-ai-data-transformer/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "datasetUrls": ["https://example.com/products.csv"],
    "prompt": "Convert to Shopify format with all required columns",
    "llmProvider": "google",
    "googleApiKey": "AIza_xxxxx",
    "useAdvancedFeatures": true
  }'

Python SDK

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

# Direct JSON input (recommended)
run = client.actor("salesmart-srl/polars-ai-data-transformer").call(
    run_input={
        "inputData": [
            {"product": "iPhone", "price": 999, "qty": 10},
            {"product": "iPad", "price": 799, "qty": 5},
        ],
        "prompt": "Calculate total = price * qty, sort descending",
    }
)

# Get results from dataset (includes output_data)
dataset = client.dataset(run["defaultDatasetId"])
items = list(dataset.iterate_items())
result = items[0]

print(f"Status: {result['status']}")
print(f"Output rows: {result['output_rows']}")
print(f"Transformed data: {result['output_data']}")  # Full data!
print(f"Generated code: {result['generated_code']}")

# With file URL
run = client.actor("salesmart-srl/polars-ai-data-transformer").call(
    run_input={
        "datasetUrls": ["https://example.com/data.csv"],
        "prompt": "Group by category and sum sales",
    }
)

# Premium transformation
run = client.actor("salesmart-srl/polars-ai-data-transformer").call(
    run_input={
        "datasetUrls": ["https://example.com/products.csv"],
        "prompt": "Convert to Shopify product CSV",
        "useAdvancedFeatures": True,
    }
)

Output Format

Response Structure

{
  "status": "success",
  "input_sources_count": 1,
  "input_rows_total": 1000,
  "input_columns": ["sku", "name", "price", "qty"],
  "output_rows": 50,
  "output_columns": ["Handle", "Title", "Variant Price"],
  "output_file": "transformed_data.csv",
  "execution_time_ms": 1234,
  "generation_info": {
    "provider": "google_pro",
    "tokens_used": 4500,
    "generation_time_ms": 890,
    "attempts": 1
  },
  "generated_code": "import polars as pl\n\nresult = ...",
  "output_preview": [
    {"Handle": "product-1", "Title": "Product One", "Variant Price": 29.99}
  ],
  "output_data": [
    {"Handle": "product-1", "Title": "Product One", "Variant Price": 29.99},
    {"Handle": "product-2", "Title": "Product Two", "Variant Price": 49.99}
  ],
  "warnings": [],
  "errors": []
}

Output Fields

Field	Description
`output_preview`	First 10 rows (always present)
`output_data`	Full transformed data (if < 10MB) - use this for API integrations!
`output_file`	Filename in Key-Value Store (for large files)

Generated Code

Every transformation returns reusable Python code:

import polars as pl

# Load your data
df = pl.read_csv("your_data.csv")

# Generated transformation (copy this!)
result = (
    df.lazy()
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("orders").count().alias("order_count")
    )
    .sort("total_revenue", descending=True)
    .head(10)
    .collect()
)

# Save
result.write_csv("output.csv")

Performance

Handles millions of rows efficiently
Typical transformation: 1-3 seconds
Uses Polars (Rust-based, 10-100x faster than Pandas)
Lazy evaluation for memory efficiency
Parallel processing for multi-file inputs

Privacy and Security

Encrypted: API keys encrypted with AES-256
Isolated: Data processed in isolated containers
No retention: Data deleted after run completion
No training: Your data is never used to train models
BYOK: Full control over your LLM API keys

Support

Issues: GitHub Issues
Actor page: Apify Store

Changelog

v0.4 (December 2024)

NEW: inputData - Pass data directly as JSON, no file hosting needed
NEW: output_data - Full transformed data in response (if < 10MB)
Single API call: data in, results out
Perfect for API integrations and automation

v0.3 (December 2024)

Migrated to google-genai SDK
ThinkingConfig for extended reasoning
Improved Google Search grounding
Code cleanup and optimization

v0.2 (December 2024)

4-tier pricing: Basic, Premium, BYOK, BYOK Premium
Premium: Gemini Pro + Google Search + RAG
RAG system with Pinecone
Multi-file support

v0.1 (December 2024)

Initial release
Multi-provider LLM support
CSV, Excel, JSON, Parquet I/O

AI Smart Scraper — Extract Data from Any Website

flreey/ai-smart-scraper

AI web scraper: describe the data you want in plain English, get clean JSON from any webpage. No CSS selectors needed. For lead gen, price monitoring, RAG, and AI agents. Powered by Gemini AI.

亲晖林

AI Extraction Agent - Smart Scraper

alizarin_refrigerator-owner/ai-extraction-agent

AI-powered data extraction using natural language prompts. Describe what you need & let AI extract structured data from any webpage automatically.

John Rippy

Smartcontext AI Web Crawler

bluelightco/smartcontext-ai-crawler

Scrape any website and extract structured data using AI-powered instructions. Provide URLs and a natural language prompt to get tailored JSON outputs.

Bluelight

135

5.0

Ai Text Generation

vivid_astronaut/ai-text-generation

Generate high-quality text content using AI. Create articles, emails, product descriptions, and more. Powered by advanced language models for natural, engaging content.

Fabio Suizu

Ai Code Review

vivid_astronaut/ai-code-review

Fabio Suizu

Ai Powered Scraper

devwithbobby/ai-powered-scraper

AI Powered Scraper using LangChain and OpenAI.

Dev with Bobby

OpenFDA AI Wrapper

dc-codes426/openfda-ai-wrapper

LLM-powered wrapper for the OpenFDA API. Access all publicly available FDA data using a natural language interface. Submit queries using structured or natural language. Receive results of FDA API search in original json format.

David Connor

Newsapi Ai

dc-codes426/newsapi-ai

LLM Agent for searching the news. Interact with structured or natural language, and receive responses in structured or natural language. Perfect for a chatbot or for your AI agents that need to look up the news.

David Connor

Agentic Crawler

hpix/agentic-crawler

An intelligent AI web scraper that navigates websites like a human. Just describe the data you need in plain English. Adapts to layout changes, handles dynamic JavaScript sites, and gets smarter with every run.

Hpix

AI Lead Extractor

dz_omar/ai-lead-extractor

Extract any information from websites using intelligent AI - from contact details to custom data fields, summaries, and creative content. Free tier: basic contact extraction. Paid tier: AI-powered dynamic extraction with natural language instructions.

FlowExtract API

4.3

Polars AI Data Transformer

AI Data Transformer

Table of Contents

Quick Start

Getting Your Apify API Token

Step 1: Create Apify Account

Step 2: Get Your API Token

Step 3: Use the Token

Free Tier

Pricing

Volume Discounts

Choosing the Right Mode

Decision Tree

Mode Comparison

Premium vs BYOK Premium: What's the Difference?

4 Operating Modes

Mode 1: Basic (Hosted)

Mode 2: Premium (Hosted)

Mode 3: BYOK (Bring Your Own Key)

Mode 4: BYOK Premium

Complete Input Options

Required

Data Sources (at least one required)

Mode Selection

Output Options

Mode Selection Logic

How It Works

Processing Pipeline

RAG Memory System

Writing Effective Prompts

Structure

Examples by Complexity

Tips

API Examples

Direct JSON Input (Recommended for API)

Basic Mode (with URL)

Premium Mode

BYOK Mode (Groq - FREE)

BYOK Premium Mode

Python SDK

Output Format

Response Structure

Output Fields

Generated Code

Performance

Privacy and Security

Support

Changelog

v0.4 (December 2024)

v0.3 (December 2024)

v0.2 (December 2024)

v0.1 (December 2024)

You might also like

AI Smart Scraper — Extract Data from Any Website

AI Extraction Agent - Smart Scraper

Smartcontext AI Web Crawler

Ai Text Generation

Ai Code Review

Ai Powered Scraper

OpenFDA AI Wrapper

Newsapi Ai

Agentic Crawler

AI Lead Extractor

Related articles