Data Cleaner
Pricing
from $1.49 / 1,000 items cleaneds
Data Cleaner
Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input.
Pricing
from $1.49 / 1,000 items cleaneds
Rating
0.0
(0)
Developer
ParseBird
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Data Cleaner
Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input. The first general-purpose data cleaner on Apify.
| Clean messy scraped data in one step — trim whitespace, normalize casing, format phone numbers to E.164, lowercase emails, extract domains from URLs, convert strings to numbers, remove null rows, and deduplicate. |
| ParseBird Infra Suite • Utility tools for data pipelines | ||
|
🔗 HTTP Request Send API calls from the cloud |
📚 Data Deduplicator Merge & deduplicate datasets by any field |
🗡 Data Cleaner ➤ You are here |
Copy to your AI assistant
Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.
parsebird/data-cleaner on Apify. Call: ApifyClient("TOKEN").actor("parsebird/data-cleaner").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for cleaned results. Key inputs: datasetId (string, Apify dataset ID), jsonData (array of objects, direct JSON input), operations (array of {field, action, options} — required), outputDatasetId (string, optional), maxItems (integer, default 1000000). Actions: trim_whitespace, normalize_case (options: {case: "lower"|"upper"|"title"}), format_email, format_phone (options: {countryCode: "US"}), extract_domain, to_number, to_date, fill_nulls (options: {value: "..."}), remove_nulls, remove_duplicates, replace_value (options: {find, replace}). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~data-cleaner (Bearer TOKEN). Get token: https://console.apify.com/account/integrations
What does Data Cleaner do?
This Actor takes messy scraped or imported data and applies a configurable pipeline of cleaning operations. Each operation targets a specific field and transforms its values — trimming whitespace, normalizing case, formatting phone numbers, and more.
Use cases:
- CRM cleanup — normalize names, emails, and phone numbers before import
- Lead list hygiene — remove rows with missing emails, deduplicate by company
- Post-scrape processing — extract domains from URLs, convert price strings to numbers
- Data pipeline prep — standardize data format before analysis or export
Supported operations
| Action | Description | Options | Before | After |
|---|---|---|---|---|
trim_whitespace | Remove leading/trailing spaces | — | " John Doe " | "John Doe" |
normalize_case | Convert to lower/upper/title case | {"case": "title"} | "john doe" | "John Doe" |
format_email | Lowercase and trim emails | — | " JOHN@CO.COM " | "john@co.com" |
format_phone | Normalize to E.164 format | {"countryCode": "US"} | "(555) 123-4567" | "+15551234567" |
extract_domain | Extract domain from URL or email | — | "https://www.example.com/page" | "example.com" |
to_number | Convert string to number | — | "$1,234,567" | 1234567 |
to_date | Parse date to ISO 8601 | — | "March 15, 2024" | "2024-03-15T00:00:00" |
fill_nulls | Replace null/empty with default | {"value": "N/A"} | null | "N/A" |
remove_nulls | Remove rows where field is null/empty | — | (row removed) | — |
remove_duplicates | Deduplicate by this field | — | (duplicate removed) | — |
replace_value | Find and replace text | {"find": "Inc.", "replace": "Inc"} | "Acme Inc." | "Acme Inc" |
Input parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
datasetId | string | No* | — | Apify dataset ID to clean |
jsonData | array | No* | — | Direct JSON array of objects to clean |
operations | array | Yes | — | List of {field, action, options} cleaning operations |
outputDatasetId | string | No | — | Named output dataset (defaults to run dataset) |
maxItems | integer | No | 1000000 | Max items to process |
*Provide either datasetId or jsonData (or both).
Operations format
Each operation is a JSON object with:
{"field": "email","action": "format_email","options": {}}
Operations are applied in order. You can chain multiple operations on the same field:
[{"field": "name", "action": "trim_whitespace"},{"field": "name", "action": "normalize_case", "options": {"case": "title"}},{"field": "email", "action": "format_email"},{"field": "phone", "action": "format_phone", "options": {"countryCode": "US"}},{"field": "website", "action": "extract_domain"},{"field": "revenue", "action": "to_number"},{"field": "email", "action": "remove_nulls"}]
Before and after example
Input (dirty data)
[{"name": " john doe ", "email": " JOHN@EXAMPLE.COM ", "phone": "(555) 123-4567", "website": "https://www.example.com/about", "revenue": "$1,234,567"},{"name": "JANE SMITH", "email": "Jane.Smith@Company.IO", "phone": "555.987.6543", "website": "info@company.io", "revenue": "2345678"},{"name": "", "email": null, "phone": "1-800-555-0199", "website": "company.io", "revenue": "$99.99"},{"name": "bob wilson", "email": "bob@test.com", "phone": "+14155550100", "website": "https://test.com/page?id=1", "revenue": "not a number"}]
Output (cleaned data)
[{"name": "John Doe", "email": "john@example.com", "phone": "+15551234567", "website": "example.com", "revenue": 1234567},{"name": "Jane Smith", "email": "jane.smith@company.io", "phone": "+15559876543", "website": "company.io", "revenue": 2345678},{"name": "Bob Wilson", "email": "bob@test.com", "phone": "+14155550100", "website": "test.com", "revenue": "not a number"}]
Row 3 was removed (null email with remove_nulls). All names are title-cased, emails lowercased, phones in E.164, domains extracted, and revenues converted to numbers.
How to use via API
Python
from apify_client import ApifyClientclient = ApifyClient("YOUR_API_TOKEN")run = client.actor("parsebird/data-cleaner").call(run_input={"datasetId": "YOUR_DATASET_ID","operations": [{"field": "email", "action": "format_email"},{"field": "name", "action": "trim_whitespace"},{"field": "name", "action": "normalize_case", "options": {"case": "title"}},{"field": "phone", "action": "format_phone", "options": {"countryCode": "US"}},],})items = client.dataset(run["defaultDatasetId"]).list_items().itemsprint(f"Cleaned items: {len(items)}")
cURL
curl -X POST "https://api.apify.com/v2/acts/parsebird~data-cleaner/runs?token=YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"jsonData": [{"name": " JOHN DOE ", "email": " JOHN@CO.COM "}],"operations": [{"field": "name", "action": "trim_whitespace"},{"field": "name", "action": "normalize_case", "options": {"case": "title"}},{"field": "email", "action": "format_email"}]}'
Output
Cleaned items retain their original structure. A stats key is stored in the key-value store:
{"totalLoaded": 5000,"totalCleaned": 4800,"operationsApplied": 7,"fieldsCleaned": 5,"totalChanges": 15200}
Pricing
This Actor uses a pay-per-event pricing model.
| Event | Price per event | Price per 1,000 |
|---|---|---|
items-cleaned | $0.00149 | $1.49 |
Charged per 1,000 items loaded. Platform compute costs are additional.