Pricing

from $10.00 / 1,000 results

Data Cleaning & Transformation Toolkit

A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON — ready for APIs, databases, or downstream processing.

Pricing

from $10.00 / 1,000 results

Rating

0.0

(0)

Developer

Jamshaid Arif

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

🧹 Data Cleaning & Transformation Toolkit — Apify Actor

A powerful, multi-mode Apify actor that transforms messy, unstructured data into clean, structured JSON — ready for APIs, databases, or downstream processing.

🎯 What It Does

Mode	Input	Output
Messy Text	Inconsistent text with mixed delimiters	Clean JSON records
Excel / CSV	`.xlsx` or `.csv` file URL	API-ready JSON with metadata
HTML Scrape	Raw HTML or live URLs	Structured dataset (tables, elements, links)
Key-Value	`.env`, `.ini`, logs, YAML-like text	Parsed JSON object or records
URL Fetch	Any webpage URL	Auto-extracted structured data

🚀 Quick Start Examples

1. Messy Text → JSON

{
    "mode": "messy_text",
    "inputText": "Name: John Doe | Age: 29 | City: New York\nName: Jane Smith | Age: 32 | City: LA",
    "textParseStrategy": "auto",
    "outputFormat": "records"
}

Output:

[
    {"name": "John Doe", "age": 29, "city": "New York"},
    {"name": "Jane Smith", "age": 32, "city": "LA"}
]

2. Excel / CSV → API-Ready JSON

{
    "mode": "excel_csv",
    "fileUrl": "https://example.com/data/sales_report.xlsx",
    "sheetName": "Q1",
    "skipEmptyRows": true,
    "outputFormat": "wrapped"
}

Output:

{
    "meta": {
        "source": "sales_report.xlsx",
        "sheet": "Q1",
        "total_records": 150,
        "columns": ["id", "product", "revenue"],
        "generated_at": "2026-04-01T12:00:00"
    },
    "data": [
        {"id": 1, "product": "Widget A", "revenue": 9999.50}
    ]
}

3. Scrape a Website

{
    "mode": "html_scrape",
    "urls": [{"url": "https://books.toscrape.com/"}],
    "htmlExtractMode": "elements",
    "cssSelector": "article.product_pod",
    "fieldMap": "{\"title\": \"h3 a\", \"price\": \".price_color\"}"
}

4. Parse Config Files

{
    "mode": "key_value",
    "inputText": "[database]\nhost = localhost\nport = 5432\n\n[cache]\ndriver = redis\nttl = 3600",
    "kvFormat": "auto",
    "outputFormat": "flat"
}

5. Auto-Extract from URL

{
    "mode": "url_fetch",
    "urls": [{"url": "https://en.wikipedia.org/wiki/Web_scraping"}],
    "outputFormat": "records"
}

⚙️ Input Schema Reference

Core Settings

Field	Type	Default	Description
`mode`	enum	`messy_text`	Transformation mode
`inputText`	string	(sample data)	Raw text input
`fileUrl`	string	`""`	URL to download a file from
`urls`	array	`[{url: "https://books.toscrape.com/"}]`	URLs to scrape
`outputFormat`	enum	`records`	Output structure: `records`, `wrapped`, or `flat`

Text Mode Options

Field	Type	Default	Description
`textParseStrategy`	enum	`auto`	`auto`, `delimited`, `key_value`, or `block`

Key-Value Mode Options

Field	Type	Default	Description
`kvFormat`	enum	`auto`	`auto`, `env`, `ini`, `log`, or `yaml`

HTML Mode Options

Field	Type	Default	Description
`htmlExtractMode`	enum	`auto`	`tables`, `elements`, `links`, `text`, or `auto`
`cssSelector`	string	`article.product_pod`	CSS selector for repeating elements
`fieldMap`	JSON string	(book fields)	Maps output keys to CSS selectors

Excel Mode Options

Field	Type	Default	Description
`sheetName`	string	`""`	Specific sheet (empty = all)
`skipEmptyRows`	boolean	`true`	Remove blank rows
`forwardFillColumns`	string	`""`	Comma-separated columns to forward-fill
`pageSize`	integer	`0`	Records per page (0 = no pagination)

Network Options

Field	Type	Default	Description
`proxyConfiguration`	object	`{useApifyProxy: true}`	Proxy settings
`maxRequestRetries`	integer	`3`	Max retries for HTTP requests

📤 Output Formats

`records` (default)

Each extracted record becomes its own row in the Apify dataset. Best for large datasets and downstream processing.

`wrapped`

A single dataset entry with meta (source info, column names, timestamps) and data (array of records). Best for API responses.

`flat`

Outputs the parsed object directly. Ideal for config file parsing where you want a single JSON object.

🧠 Smart Features

Auto-Detection: Every mode has an auto strategy that detects the input format
Type Casting: Strings like "42", "true", "null" are automatically cast to native types
Key Normalization: All field names are converted to snake_case
Merged Cell Handling: Forward-fill support for Excel files with merged cells
Pagination: Built-in page/pageSize support for large Excel datasets
Metadata: Wrapped output includes source file, column names, timestamps, and record counts
Error Resilience: Failed URLs are logged with _error fields instead of crashing

📂 Project Structure

apify-actor/
├── .actor/
│   ├── actor.json           # Actor configuration
│   └── input_schema.json    # Input schema with defaults
├── main.py                  # Actor entry point
├── data_transformer.py      # Core transformation engine
├── requirements.txt         # Python dependencies
├── Dockerfile               # Container build instructions
└── README.md                # This file

🏗️ Local Development

# Install dependencies
pip install -r requirements.txt

# Run locally with Apify CLI
apify run --input='{"mode": "messy_text", "inputText": "Name: Alice | Age: 30"}'

📜 License

ISC

Text-to-JSON Structured Extractor

moving_beacon-owner1/my-actor-68

A versatile Apify actor that converts unstructured text and HTML into clean, structured JSON. Supports four extraction modes with auto-detection, URL fetching, and batch processing.

Jamshaid Arif

Tool: Any Text to Structured JSON

scrapers_lat/text-to-json-tool

Convert any unstructured text into clean structured JSON that matches the schema you define, using premium AI. Paid Apify plans only.

Scrapers Lat

Structured Data Crawler

tempting_district/structured-data-crawler

Crawl public web pages and convert unstructured content into clean, deterministic, schema-first structured records.

Lone

🔥 AI HTML to JSON Extractor (Fast, Free LLM for Data)

autoscaler/ai-html-to-json-extractor

Eliminate messy HTML cleanup and high LLM costs. This Actor uses a high-speed, zero-cost large language model to turn unstructured content (HTML, text, reviews, blog posts) into valid, structured JSON.

Mooo

Code Converter Toolkit

moving_beacon-owner1/my-actor-64

A universal code conversion actor that transforms between 6 popular code formats in a single run. Supports both single and batch conversions with structured JSON output.

Jamshaid Arif

TableTidy — Messy CSV to Clean Typed Data (MCP)

beacon_labs/tabletidy-mcp

Agent-callable MCP server that turns messy CSV/spreadsheets into clean, typed, structured data: header detection, type inference, coercion, junk-row removal.

Charles Doherty

JSON To XML Converter

zsoftware/json-to-xml-converter

Easily convert structured JSON data into well-formed XML. This actor accepts raw JSON text or a file and outputs clean, standards-compliant XML—perfect for data transformation pipelines, integrations, or legacy system compatibility.

Karim

Apify Actor

anukulpandey/apify-actor

Anukul Pandey

Image To Text Ai

welcoming_fireplace/image-to-text-ai

A powerful OCR tool that goes beyond standard text extraction. Powered by a Premium Vision AI model, it accurately reads handwriting, preserves table structures, and converts messy receipts or documents into structured JSON or Markdown. Supports batch processing for high-volume workflows.