Data Cleaner avatar

Data Cleaner

Pricing

from $1.49 / 1,000 items cleaneds

Go to Apify Store
Data Cleaner

Data Cleaner

Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input.

Pricing

from $1.49 / 1,000 items cleaneds

Rating

0.0

(0)

Developer

ParseBird

ParseBird

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Data Cleaner

Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input. The first general-purpose data cleaner on Apify.

Clean messy scraped data in one step — trim whitespace, normalize casing, format phone numbers to E.164, lowercase emails, extract domains from URLs, convert strings to numbers, remove null rows, and deduplicate.

ParseBird Infra Suite   •  Utility tools for data pipelines
🔗  HTTP Request
Send API calls from the cloud
📚  Data Deduplicator
Merge & deduplicate datasets by any field
🗡  Data Cleaner
➤ You are here

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parsebird/data-cleaner on Apify. Call: ApifyClient("TOKEN").actor("parsebird/data-cleaner").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for cleaned results. Key inputs: datasetId (string, Apify dataset ID), jsonData (array of objects, direct JSON input), operations (array of {field, action, options} — required), outputDatasetId (string, optional), maxItems (integer, default 1000000). Actions: trim_whitespace, normalize_case (options: {case: "lower"|"upper"|"title"}), format_email, format_phone (options: {countryCode: "US"}), extract_domain, to_number, to_date, fill_nulls (options: {value: "..."}), remove_nulls, remove_duplicates, replace_value (options: {find, replace}). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~data-cleaner (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

What does Data Cleaner do?

This Actor takes messy scraped or imported data and applies a configurable pipeline of cleaning operations. Each operation targets a specific field and transforms its values — trimming whitespace, normalizing case, formatting phone numbers, and more.

Use cases:

  • CRM cleanup — normalize names, emails, and phone numbers before import
  • Lead list hygiene — remove rows with missing emails, deduplicate by company
  • Post-scrape processing — extract domains from URLs, convert price strings to numbers
  • Data pipeline prep — standardize data format before analysis or export

Supported operations

ActionDescriptionOptionsBeforeAfter
trim_whitespaceRemove leading/trailing spaces" John Doe ""John Doe"
normalize_caseConvert to lower/upper/title case{"case": "title"}"john doe""John Doe"
format_emailLowercase and trim emails" JOHN@CO.COM ""john@co.com"
format_phoneNormalize to E.164 format{"countryCode": "US"}"(555) 123-4567""+15551234567"
extract_domainExtract domain from URL or email"https://www.example.com/page""example.com"
to_numberConvert string to number"$1,234,567"1234567
to_dateParse date to ISO 8601"March 15, 2024""2024-03-15T00:00:00"
fill_nullsReplace null/empty with default{"value": "N/A"}null"N/A"
remove_nullsRemove rows where field is null/empty(row removed)
remove_duplicatesDeduplicate by this field(duplicate removed)
replace_valueFind and replace text{"find": "Inc.", "replace": "Inc"}"Acme Inc.""Acme Inc"

Input parameters

ParameterTypeRequiredDefaultDescription
datasetIdstringNo*Apify dataset ID to clean
jsonDataarrayNo*Direct JSON array of objects to clean
operationsarrayYesList of {field, action, options} cleaning operations
outputDatasetIdstringNoNamed output dataset (defaults to run dataset)
maxItemsintegerNo1000000Max items to process

*Provide either datasetId or jsonData (or both).

Operations format

Each operation is a JSON object with:

{
"field": "email",
"action": "format_email",
"options": {}
}

Operations are applied in order. You can chain multiple operations on the same field:

[
{"field": "name", "action": "trim_whitespace"},
{"field": "name", "action": "normalize_case", "options": {"case": "title"}},
{"field": "email", "action": "format_email"},
{"field": "phone", "action": "format_phone", "options": {"countryCode": "US"}},
{"field": "website", "action": "extract_domain"},
{"field": "revenue", "action": "to_number"},
{"field": "email", "action": "remove_nulls"}
]

Before and after example

Input (dirty data)

[
{"name": " john doe ", "email": " JOHN@EXAMPLE.COM ", "phone": "(555) 123-4567", "website": "https://www.example.com/about", "revenue": "$1,234,567"},
{"name": "JANE SMITH", "email": "Jane.Smith@Company.IO", "phone": "555.987.6543", "website": "info@company.io", "revenue": "2345678"},
{"name": "", "email": null, "phone": "1-800-555-0199", "website": "company.io", "revenue": "$99.99"},
{"name": "bob wilson", "email": "bob@test.com", "phone": "+14155550100", "website": "https://test.com/page?id=1", "revenue": "not a number"}
]

Output (cleaned data)

[
{"name": "John Doe", "email": "john@example.com", "phone": "+15551234567", "website": "example.com", "revenue": 1234567},
{"name": "Jane Smith", "email": "jane.smith@company.io", "phone": "+15559876543", "website": "company.io", "revenue": 2345678},
{"name": "Bob Wilson", "email": "bob@test.com", "phone": "+14155550100", "website": "test.com", "revenue": "not a number"}
]

Row 3 was removed (null email with remove_nulls). All names are title-cased, emails lowercased, phones in E.164, domains extracted, and revenues converted to numbers.

How to use via API

Python

from apify_client import ApifyClient
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("parsebird/data-cleaner").call(run_input={
"datasetId": "YOUR_DATASET_ID",
"operations": [
{"field": "email", "action": "format_email"},
{"field": "name", "action": "trim_whitespace"},
{"field": "name", "action": "normalize_case", "options": {"case": "title"}},
{"field": "phone", "action": "format_phone", "options": {"countryCode": "US"}},
],
})
items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Cleaned items: {len(items)}")

cURL

curl -X POST "https://api.apify.com/v2/acts/parsebird~data-cleaner/runs?token=YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"jsonData": [
{"name": " JOHN DOE ", "email": " JOHN@CO.COM "}
],
"operations": [
{"field": "name", "action": "trim_whitespace"},
{"field": "name", "action": "normalize_case", "options": {"case": "title"}},
{"field": "email", "action": "format_email"}
]
}'

Output

Cleaned items retain their original structure. A stats key is stored in the key-value store:

{
"totalLoaded": 5000,
"totalCleaned": 4800,
"operationsApplied": 7,
"fieldsCleaned": 5,
"totalChanges": 15200
}

Pricing

This Actor uses a pay-per-event pricing model.

EventPrice per eventPrice per 1,000
items-cleaned$0.00149$1.49

Charged per 1,000 items loaded. Platform compute costs are additional.