# Lunar Data Cleaner (`liuyu.digitaltwin/lunar-data-cleaner`) Actor

Clean CSV, Excel, or JSON files with audit trails, PII masking, and budget control. Remove duplicates, fix missing values, standardize dates/numbers, and get quality reports.

- **URL**: https://apify.com/liuyu.digitaltwin/lunar-data-cleaner.md
- **Developed by:** [Yu Liu](https://apify.com/liuyu.digitaltwin) (community)
- **Categories:** Automation, Developer tools, Lead generation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.03 / 1,000 rows cleaneds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## General Data Cleaner

Automatically clean CSV, Excel, and JSON files with full audit trails, PII masking, and budget control. Remove duplicates, fix missing values, standardize dates and numbers, and get quality reports — all in one step.

### Why use this Actor?

- Save time – Stop manually fixing spreadsheets. Let the cleaner handle missing values, duplicates, outliers, and format inconsistencies.
- Audit ready – Every change is logged. You get task-level and rule-level CSV reports, plus an HTML quality report with ISO 8000 scores.
- Privacy safe – Enable PII masking to automatically redact SSNs, credit card numbers, and emails (irreversible).
- Cost control – Set your budget limit; the Actor stops when reached (hard cap $10.00).
- Preview mode – Test with first N rows before cleaning your full dataset.

### Input Parameters (JSON)

Provide input as a JSON object. Example:
```json
{
  "sourceData": "https://example.com/my-data.csv",
  "delimiter": "auto",
  "encoding": "auto",
  "auditLevel": "rule",
  "maxChargeUsd": 5.0,
  "previewMode": false,
  "previewRows": 100,
  "enablePiiMasking": false,
  "piiMaskingRules": "",
  "cellErrorPolicy": "skip_cell",
  "outputFormat": "csv",
  "locale": "US"
}
````

#### Parameter reference

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| sourceData | string | Yes | – | URL, local path, or Apify Dataset ID (e.g., dataset-username/dataset-name) of the input file. Supports CSV, Excel (.xlsx, .xls), JSON. |
| delimiter | string | No | auto | Column delimiter: auto (detect), ",", ";", "\t", "|". |
| encoding | string | No | auto | File encoding: auto (detect), "utf-8", "iso-8859-1", etc. |
| auditLevel | string | No | rule | Audit granularity: "task" (only run summary) or "rule" (per cleaning rule). |
| maxChargeUsd | number | No | 5.0 | Budget limit (USD). Actual stop = min(this, system hard cap $10.0). |
| previewMode | boolean | No | false | If true, only process first previewRows. |
| previewRows | integer | No | 100 | Number of rows to process when previewMode=true (max 10,000). |
| enablePiiMasking | boolean | No | false | If true, mask SSN, credit cards, emails (irreversible). |
| piiMaskingRules | string | No | "" | Comma-separated column names or regex patterns for additional masking (e.g., "phone,custom\_id"). |
| cellErrorPolicy | string | No | skip\_cell | How to handle cell conversion errors: "skip\_cell" (keep original, continue) or "stop\_rule" (fail the rule). |
| outputFormat | string | No | csv | Output format: "csv", "excel", or "json". |
| locale | string | No | US | Date and number format: "US" (MM/DD/YYYY, 1,234.56) or "EU" (DD/MM/YYYY, 1.234,56). |

### Output Files

After a successful run, you will find the following files:

| File | Location | Description |
|------|----------|-------------|
| Cleaned data | Apify Dataset | The cleaned dataset in your chosen format (CSV/Excel/JSON). |
| Task audit | Key-Value Store → audit\_task\_{{run.id}}.csv | One row per run: session\_id, timestamps, exit reason, budget used, preview mode flag. |
| Rule audit | Key-Value Store → audit\_rules\_{{run.id}}.csv | One row per cleaning rule: rule\_id, affected rows, execution time, status (only when auditLevel=rule). |
| Skipped rows | Key-Value Store → audit\_skipped\_{{run.id}}.csv | Rows that could not be parsed (e.g., encoding errors, column mismatches). |
| Quality report (HTML) | Key-Value Store → quality\_report\_{{run.id}}.html | Human‑readable report with ISO 8000 dimension scores (completeness, accuracy, consistency, format). |
| Quality report (JSON) | Key-Value Store → quality\_report\_{{run.id}}.json | Same data in JSON format. |
| Error report | Key-Value Store → errors\_{{run.id}}.json | Detailed error information (if any). |
| Debug log | Key-Value Store → debug\_log\_{{run.id}}.txt | Last 1000 log lines (saved only if an error occurs). |

> Tip: The first record in the Dataset is an OUTPUT\_SUMMARY that lists all the above keys. You can also access the Key-Value Store directly via the Apify Console.

### Usage Examples

#### 1. Basic cleaning (run with defaults)

If you set a default sourceData (e.g., our example CSV), simply click Run with defaults. The Actor will clean the example file and output the results.

#### 2. Clean your own file from a URL

Set sourceData to the URL of your CSV/Excel/JSON file.

#### 3. Preview mode (test before full run)

Set previewMode to true and previewRows to e.g. 100.

#### 4. Enable PII masking

Set enablePiiMasking to true and optionally piiMaskingRules.

#### 5. European locale (EU)

Set locale to "EU" and delimiter to ";" if needed.

### How to Use (Apify Console)

1. Go to the Actor page.
2. In the Input tab, switch to JSON mode (or use the form).
3. Paste your JSON configuration (see examples above).
4. Click Start.
5. Download the cleaned dataset from the Dataset tab, and audit/quality reports from the Key-Value Store.

### Bugs, fixes, updates, and changelog

This product is under active development. If you encounter any issues, have feature requests, or would like to provide feedback, please open an issue on our GitHub repository:

👉 [here](https://github.com/yuliu-digitaltwin/apify-lunar-data-cleaner/issues)

### Support

Email: liuyu.digitaltwin@outlook.com
Please include your session\_id (found in the Actor run log or task audit CSV) when reporting issues.

# Actor input Schema

## `sourceData` (type: `string`):

URL or local path to CSV/Excel/JSON file

## `delimiter` (type: `string`):

Column delimiter (auto-detect or manual)

## `encoding` (type: `string`):

File encoding (auto-detect or manual, e.g., utf-8, iso-8859-1)

## `auditLevel` (type: `string`):

Detailed audit logs: task only or rule-level

## `maxChargeUsd` (type: `number`):

Budget limit (soft cap, system hard cap is $10.00)

## `previewMode` (type: `boolean`):

Process only first N rows to test

## `previewRows` (type: `integer`):

Number of rows in preview mode

## `enablePiiMasking` (type: `boolean`):

Mask SSN, credit cards, emails (irreversible)

## `piiMaskingRules` (type: `string`):

Comma-separated column names or regex patterns (e.g., phone,ssn)

## `cellErrorPolicy` (type: `string`):

How to handle cell conversion errors

## `outputFormat` (type: `string`):

Format of the cleaned output

## `locale` (type: `string`):

Date and number format (US or European)

## Actor input object example

```json
{
  "sourceData": "https://raw.githubusercontent.com/yuliu-digitaltwin/apify-lunar-data-cleaner/main/example-data.csv",
  "delimiter": "auto",
  "encoding": "auto",
  "auditLevel": "rule",
  "maxChargeUsd": 5,
  "previewMode": false,
  "previewRows": 100,
  "enablePiiMasking": false,
  "cellErrorPolicy": "skip_cell",
  "outputFormat": "csv",
  "locale": "US"
}
```

# Actor output Schema

## `cleaned_data` (type: `string`):

The main dataset after cleaning (CSV/Excel/JSON).

## `task_audit` (type: `string`):

Task-level audit CSV.

## `rule_audit` (type: `string`):

Rule-level audit CSV (if auditLevel=rule).

## `skipped_rows` (type: `string`):

Rows skipped due to parsing errors.

## `quality_report_html` (type: `string`):

Quality report in HTML format.

## `quality_report_json` (type: `string`):

Quality report in JSON format.

## `error_report` (type: `string`):

Error report (if any).

## `debug_log` (type: `string`):

Debug log (if error occurred).

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("liuyu.digitaltwin/lunar-data-cleaner").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("liuyu.digitaltwin/lunar-data-cleaner").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call liuyu.digitaltwin/lunar-data-cleaner --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=liuyu.digitaltwin/lunar-data-cleaner",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Lunar Data Cleaner",
        "description": "Clean CSV, Excel, or JSON files with audit trails, PII masking, and budget control. Remove duplicates, fix missing values, standardize dates/numbers, and get quality reports.",
        "version": "0.0",
        "x-build-id": "Eh2pF5M6BNCqrZGUJ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/liuyu.digitaltwin~lunar-data-cleaner/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-liuyu.digitaltwin-lunar-data-cleaner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/liuyu.digitaltwin~lunar-data-cleaner/runs": {
            "post": {
                "operationId": "runs-sync-liuyu.digitaltwin-lunar-data-cleaner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/liuyu.digitaltwin~lunar-data-cleaner/run-sync": {
            "post": {
                "operationId": "run-sync-liuyu.digitaltwin-lunar-data-cleaner",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "sourceData"
                ],
                "properties": {
                    "sourceData": {
                        "title": "Source Data",
                        "type": "string",
                        "description": "URL or local path to CSV/Excel/JSON file",
                        "default": "https://raw.githubusercontent.com/yuliu-digitaltwin/apify-lunar-data-cleaner/main/example-data.csv"
                    },
                    "delimiter": {
                        "title": "Delimiter",
                        "enum": [
                            "auto",
                            ",",
                            ";",
                            "\t",
                            "|"
                        ],
                        "type": "string",
                        "description": "Column delimiter (auto-detect or manual)",
                        "default": "auto"
                    },
                    "encoding": {
                        "title": "Encoding",
                        "type": "string",
                        "description": "File encoding (auto-detect or manual, e.g., utf-8, iso-8859-1)",
                        "default": "auto"
                    },
                    "auditLevel": {
                        "title": "Audit Level",
                        "enum": [
                            "task",
                            "rule"
                        ],
                        "type": "string",
                        "description": "Detailed audit logs: task only or rule-level",
                        "default": "rule"
                    },
                    "maxChargeUsd": {
                        "title": "Max Charge (USD)",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "number",
                        "description": "Budget limit (soft cap, system hard cap is $10.00)",
                        "default": 5
                    },
                    "previewMode": {
                        "title": "Preview Mode",
                        "type": "boolean",
                        "description": "Process only first N rows to test",
                        "default": false
                    },
                    "previewRows": {
                        "title": "Preview Rows",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Number of rows in preview mode",
                        "default": 100
                    },
                    "enablePiiMasking": {
                        "title": "Enable PII Masking",
                        "type": "boolean",
                        "description": "Mask SSN, credit cards, emails (irreversible)",
                        "default": false
                    },
                    "piiMaskingRules": {
                        "title": "Custom PII Rules",
                        "type": "string",
                        "description": "Comma-separated column names or regex patterns (e.g., phone,ssn)"
                    },
                    "cellErrorPolicy": {
                        "title": "Cell Error Policy",
                        "enum": [
                            "skip_cell",
                            "stop_rule"
                        ],
                        "type": "string",
                        "description": "How to handle cell conversion errors",
                        "default": "skip_cell"
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "csv",
                            "excel",
                            "json"
                        ],
                        "type": "string",
                        "description": "Format of the cleaned output",
                        "default": "csv"
                    },
                    "locale": {
                        "title": "Locale",
                        "enum": [
                            "US",
                            "EU"
                        ],
                        "type": "string",
                        "description": "Date and number format (US or European)",
                        "default": "US"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
