Data Cleaning & Transformation Toolkit avatar

Data Cleaning & Transformation Toolkit

Pricing

from $10.00 / 1,000 results

Go to Apify Store
Data Cleaning & Transformation Toolkit

Data Cleaning & Transformation Toolkit

A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON — ready for APIs, databases, or downstream processing.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Jamshaid Arif

Jamshaid Arif

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 days ago

Last modified

Share

🧹 Data Cleaning & Transformation Toolkit — Apify Actor

A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON — ready for APIs, databases, or downstream processing.


🎯 What It Does

ModeInputOutput
Messy TextInconsistent text with mixed delimitersClean JSON records
Excel / CSV.xlsx or .csv file URLAPI-ready JSON with metadata
HTML ScrapeRaw HTML or live URLsStructured dataset (tables, elements, links)
Key-Value.env, .ini, logs, YAML-like textParsed JSON object or records
URL FetchAny webpage URLAuto-extracted structured data

🚀 Quick Start Examples

1. Messy Text → JSON

{
"mode": "messy_text",
"inputText": "Name: John Doe | Age: 29 | City: New York\nName: Jane Smith | Age: 32 | City: LA",
"textParseStrategy": "auto",
"outputFormat": "records"
}

Output:

[
{"name": "John Doe", "age": 29, "city": "New York"},
{"name": "Jane Smith", "age": 32, "city": "LA"}
]

2. Excel / CSV → API-Ready JSON

{
"mode": "excel_csv",
"fileUrl": "https://example.com/data/sales_report.xlsx",
"sheetName": "Q1",
"skipEmptyRows": true,
"outputFormat": "wrapped"
}

Output:

{
"meta": {
"source": "sales_report.xlsx",
"sheet": "Q1",
"total_records": 150,
"columns": ["id", "product", "revenue"],
"generated_at": "2026-04-01T12:00:00"
},
"data": [
{"id": 1, "product": "Widget A", "revenue": 9999.50}
]
}

3. Scrape a Website

{
"mode": "html_scrape",
"urls": [{"url": "https://books.toscrape.com/"}],
"htmlExtractMode": "elements",
"cssSelector": "article.product_pod",
"fieldMap": "{\"title\": \"h3 a\", \"price\": \".price_color\"}"
}

4. Parse Config Files

{
"mode": "key_value",
"inputText": "[database]\nhost = localhost\nport = 5432\n\n[cache]\ndriver = redis\nttl = 3600",
"kvFormat": "auto",
"outputFormat": "flat"
}

5. Auto-Extract from URL

{
"mode": "url_fetch",
"urls": [{"url": "https://en.wikipedia.org/wiki/Web_scraping"}],
"outputFormat": "records"
}

⚙️ Input Schema Reference

Core Settings

FieldTypeDefaultDescription
modeenummessy_textTransformation mode
inputTextstring(sample data)Raw text input
fileUrlstring""URL to download a file from
urlsarray[{url: "https://books.toscrape.com/"}]URLs to scrape
outputFormatenumrecordsOutput structure: records, wrapped, or flat

Text Mode Options

FieldTypeDefaultDescription
textParseStrategyenumautoauto, delimited, key_value, or block

Key-Value Mode Options

FieldTypeDefaultDescription
kvFormatenumautoauto, env, ini, log, or yaml

HTML Mode Options

FieldTypeDefaultDescription
htmlExtractModeenumautotables, elements, links, text, or auto
cssSelectorstringarticle.product_podCSS selector for repeating elements
fieldMapJSON string(book fields)Maps output keys to CSS selectors

Excel Mode Options

FieldTypeDefaultDescription
sheetNamestring""Specific sheet (empty = all)
skipEmptyRowsbooleantrueRemove blank rows
forwardFillColumnsstring""Comma-separated columns to forward-fill
pageSizeinteger0Records per page (0 = no pagination)

Network Options

FieldTypeDefaultDescription
proxyConfigurationobject{useApifyProxy: true}Proxy settings
maxRequestRetriesinteger3Max retries for HTTP requests

📤 Output Formats

records (default)

Each extracted record becomes its own row in the Apify dataset. Best for large datasets and downstream processing.

wrapped

A single dataset entry with meta (source info, column names, timestamps) and data (array of records). Best for API responses.

flat

Outputs the parsed object directly. Ideal for config file parsing where you want a single JSON object.


🧠 Smart Features

  • Auto-Detection: Every mode has an auto strategy that detects the input format
  • Type Casting: Strings like "42", "true", "null" are automatically cast to native types
  • Key Normalization: All field names are converted to snake_case
  • Merged Cell Handling: Forward-fill support for Excel files with merged cells
  • Pagination: Built-in page/pageSize support for large Excel datasets
  • Metadata: Wrapped output includes source file, column names, timestamps, and record counts
  • Error Resilience: Failed URLs are logged with _error fields instead of crashing

📂 Project Structure

apify-actor/
├── .actor/
│ ├── actor.json # Actor configuration
│ └── input_schema.json # Input schema with defaults
├── main.py # Actor entry point
├── data_transformer.py # Core transformation engine
├── requirements.txt # Python dependencies
├── Dockerfile # Container build instructions
└── README.md # This file

🏗️ Local Development

# Install dependencies
pip install -r requirements.txt
# Run locally with Apify CLI
apify run --input='{"mode": "messy_text", "inputText": "Name: Alice | Age: 30"}'

📜 License

ISC