Data Cleaning & Transformation Toolkit
Pricing
from $10.00 / 1,000 results
Data Cleaning & Transformation Toolkit
A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON — ready for APIs, databases, or downstream processing.
Pricing
from $10.00 / 1,000 results
Rating
0.0
(0)
Developer
Jamshaid Arif
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
🧹 Data Cleaning & Transformation Toolkit — Apify Actor
A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON — ready for APIs, databases, or downstream processing.
🎯 What It Does
| Mode | Input | Output |
|---|---|---|
| Messy Text | Inconsistent text with mixed delimiters | Clean JSON records |
| Excel / CSV | .xlsx or .csv file URL | API-ready JSON with metadata |
| HTML Scrape | Raw HTML or live URLs | Structured dataset (tables, elements, links) |
| Key-Value | .env, .ini, logs, YAML-like text | Parsed JSON object or records |
| URL Fetch | Any webpage URL | Auto-extracted structured data |
🚀 Quick Start Examples
1. Messy Text → JSON
{"mode": "messy_text","inputText": "Name: John Doe | Age: 29 | City: New York\nName: Jane Smith | Age: 32 | City: LA","textParseStrategy": "auto","outputFormat": "records"}
Output:
[{"name": "John Doe", "age": 29, "city": "New York"},{"name": "Jane Smith", "age": 32, "city": "LA"}]
2. Excel / CSV → API-Ready JSON
{"mode": "excel_csv","fileUrl": "https://example.com/data/sales_report.xlsx","sheetName": "Q1","skipEmptyRows": true,"outputFormat": "wrapped"}
Output:
{"meta": {"source": "sales_report.xlsx","sheet": "Q1","total_records": 150,"columns": ["id", "product", "revenue"],"generated_at": "2026-04-01T12:00:00"},"data": [{"id": 1, "product": "Widget A", "revenue": 9999.50}]}
3. Scrape a Website
{"mode": "html_scrape","urls": [{"url": "https://books.toscrape.com/"}],"htmlExtractMode": "elements","cssSelector": "article.product_pod","fieldMap": "{\"title\": \"h3 a\", \"price\": \".price_color\"}"}
4. Parse Config Files
{"mode": "key_value","inputText": "[database]\nhost = localhost\nport = 5432\n\n[cache]\ndriver = redis\nttl = 3600","kvFormat": "auto","outputFormat": "flat"}
5. Auto-Extract from URL
{"mode": "url_fetch","urls": [{"url": "https://en.wikipedia.org/wiki/Web_scraping"}],"outputFormat": "records"}
⚙️ Input Schema Reference
Core Settings
| Field | Type | Default | Description |
|---|---|---|---|
mode | enum | messy_text | Transformation mode |
inputText | string | (sample data) | Raw text input |
fileUrl | string | "" | URL to download a file from |
urls | array | [{url: "https://books.toscrape.com/"}] | URLs to scrape |
outputFormat | enum | records | Output structure: records, wrapped, or flat |
Text Mode Options
| Field | Type | Default | Description |
|---|---|---|---|
textParseStrategy | enum | auto | auto, delimited, key_value, or block |
Key-Value Mode Options
| Field | Type | Default | Description |
|---|---|---|---|
kvFormat | enum | auto | auto, env, ini, log, or yaml |
HTML Mode Options
| Field | Type | Default | Description |
|---|---|---|---|
htmlExtractMode | enum | auto | tables, elements, links, text, or auto |
cssSelector | string | article.product_pod | CSS selector for repeating elements |
fieldMap | JSON string | (book fields) | Maps output keys to CSS selectors |
Excel Mode Options
| Field | Type | Default | Description |
|---|---|---|---|
sheetName | string | "" | Specific sheet (empty = all) |
skipEmptyRows | boolean | true | Remove blank rows |
forwardFillColumns | string | "" | Comma-separated columns to forward-fill |
pageSize | integer | 0 | Records per page (0 = no pagination) |
Network Options
| Field | Type | Default | Description |
|---|---|---|---|
proxyConfiguration | object | {useApifyProxy: true} | Proxy settings |
maxRequestRetries | integer | 3 | Max retries for HTTP requests |
📤 Output Formats
records (default)
Each extracted record becomes its own row in the Apify dataset. Best for large datasets and downstream processing.
wrapped
A single dataset entry with meta (source info, column names, timestamps) and data (array of records). Best for API responses.
flat
Outputs the parsed object directly. Ideal for config file parsing where you want a single JSON object.
🧠 Smart Features
- Auto-Detection: Every mode has an
autostrategy that detects the input format - Type Casting: Strings like
"42","true","null"are automatically cast to native types - Key Normalization: All field names are converted to
snake_case - Merged Cell Handling: Forward-fill support for Excel files with merged cells
- Pagination: Built-in page/pageSize support for large Excel datasets
- Metadata: Wrapped output includes source file, column names, timestamps, and record counts
- Error Resilience: Failed URLs are logged with
_errorfields instead of crashing
📂 Project Structure
apify-actor/├── .actor/│ ├── actor.json # Actor configuration│ └── input_schema.json # Input schema with defaults├── main.py # Actor entry point├── data_transformer.py # Core transformation engine├── requirements.txt # Python dependencies├── Dockerfile # Container build instructions└── README.md # This file
🏗️ Local Development
# Install dependenciespip install -r requirements.txt# Run locally with Apify CLIapify run --input='{"mode": "messy_text", "inputText": "Name: Alice | Age: 30"}'
📜 License
ISC