Pricing

from $1.49 / 1,000 items cleaneds

Data Cleaner

Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input.

Pricing

from $1.49 / 1,000 items cleaneds

Rating

0.0

(0)

Developer

ParseBird

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Data Cleaner

Clean messy data — remove nulls, normalize case, trim whitespace, format phone numbers and emails, extract domains, convert types, and more. Works with Apify datasets or direct JSON input. The first general-purpose data cleaner on Apify.

Clean messy scraped data in one step — trim whitespace, normalize casing, format phone numbers to E.164, lowercase emails, extract domains from URLs, convert strings to numbers, remove null rows, and deduplicate.

ParseBird Infra Suite • Utility tools for data pipelines
🔗 HTTP Request Send API calls from the cloud	📚 Data Deduplicator Merge & deduplicate datasets by any field	🗡 Data Cleaner ➤ You are here

Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

parsebird/data-cleaner on Apify. Call: ApifyClient("TOKEN").actor("parsebird/data-cleaner").call(run_input={...}), then client.dataset(run["defaultDatasetId"]).list_items().items for cleaned results. Key inputs: datasetId (string, Apify dataset ID), jsonData (array of objects, direct JSON input), operations (array of {field, action, options} — required), outputDatasetId (string, optional), maxItems (integer, default 1000000). Actions: trim_whitespace, normalize_case (options: {case: "lower"|"upper"|"title"}), format_email, format_phone (options: {countryCode: "US"}), extract_domain, to_number, to_date, fill_nulls (options: {value: "..."}), remove_nulls, remove_duplicates, replace_value (options: {find, replace}). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parsebird~data-cleaner (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

What does Data Cleaner do?

This Actor takes messy scraped or imported data and applies a configurable pipeline of cleaning operations. Each operation targets a specific field and transforms its values — trimming whitespace, normalizing case, formatting phone numbers, and more.

Use cases:

CRM cleanup — normalize names, emails, and phone numbers before import
Lead list hygiene — remove rows with missing emails, deduplicate by company
Post-scrape processing — extract domains from URLs, convert price strings to numbers
Data pipeline prep — standardize data format before analysis or export

Supported operations

Action	Description	Options	Before	After
`trim_whitespace`	Remove leading/trailing spaces	—	`" John Doe "`	`"John Doe"`
`normalize_case`	Convert to lower/upper/title case	`{"case": "title"}`	`"john doe"`	`"John Doe"`
`format_email`	Lowercase and trim emails	—	`" JOHN@CO.COM "`	`"john@co.com"`
`format_phone`	Normalize to E.164 format	`{"countryCode": "US"}`	`"(555) 123-4567"`	`"+15551234567"`
`extract_domain`	Extract domain from URL or email	—	`"https://www.example.com/page"`	`"example.com"`
`to_number`	Convert string to number	—	`"$1,234,567"`	`1234567`
`to_date`	Parse date to ISO 8601	—	`"March 15, 2024"`	`"2024-03-15T00:00:00"`
`fill_nulls`	Replace null/empty with default	`{"value": "N/A"}`	`null`	`"N/A"`
`remove_nulls`	Remove rows where field is null/empty	—	(row removed)	—
`remove_duplicates`	Deduplicate by this field	—	(duplicate removed)	—
`replace_value`	Find and replace text	`{"find": "Inc.", "replace": "Inc"}`	`"Acme Inc."`	`"Acme Inc"`

Input parameters

Parameter	Type	Required	Default	Description
`datasetId`	string	No*	—	Apify dataset ID to clean
`jsonData`	array	No*	—	Direct JSON array of objects to clean
`operations`	array	Yes	—	List of `{field, action, options}` cleaning operations
`outputDatasetId`	string	No	—	Named output dataset (defaults to run dataset)
`maxItems`	integer	No	`1000000`	Max items to process

*Provide either datasetId or jsonData (or both).

Operations format

Each operation is a JSON object with:

{
    "field": "email",
    "action": "format_email",
    "options": {}
}

Operations are applied in order. You can chain multiple operations on the same field:

[
    {"field": "name", "action": "trim_whitespace"},
    {"field": "name", "action": "normalize_case", "options": {"case": "title"}},
    {"field": "email", "action": "format_email"},
    {"field": "phone", "action": "format_phone", "options": {"countryCode": "US"}},
    {"field": "website", "action": "extract_domain"},
    {"field": "revenue", "action": "to_number"},
    {"field": "email", "action": "remove_nulls"}
]

Before and after example

Input (dirty data)

[
    {"name": "  john doe  ", "email": "  JOHN@EXAMPLE.COM  ", "phone": "(555) 123-4567", "website": "https://www.example.com/about", "revenue": "$1,234,567"},
    {"name": "JANE SMITH", "email": "Jane.Smith@Company.IO", "phone": "555.987.6543", "website": "info@company.io", "revenue": "2345678"},
    {"name": "", "email": null, "phone": "1-800-555-0199", "website": "company.io", "revenue": "$99.99"},
    {"name": "bob wilson", "email": "bob@test.com", "phone": "+14155550100", "website": "https://test.com/page?id=1", "revenue": "not a number"}
]

Output (cleaned data)

[
    {"name": "John Doe", "email": "john@example.com", "phone": "+15551234567", "website": "example.com", "revenue": 1234567},
    {"name": "Jane Smith", "email": "jane.smith@company.io", "phone": "+15559876543", "website": "company.io", "revenue": 2345678},
    {"name": "Bob Wilson", "email": "bob@test.com", "phone": "+14155550100", "website": "test.com", "revenue": "not a number"}
]

Row 3 was removed (null email with remove_nulls). All names are title-cased, emails lowercased, phones in E.164, domains extracted, and revenues converted to numbers.

How to use via API

Python

from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

run = client.actor("parsebird/data-cleaner").call(run_input={
    "datasetId": "YOUR_DATASET_ID",
    "operations": [
        {"field": "email", "action": "format_email"},
        {"field": "name", "action": "trim_whitespace"},
        {"field": "name", "action": "normalize_case", "options": {"case": "title"}},
        {"field": "phone", "action": "format_phone", "options": {"countryCode": "US"}},
    ],
})

items = client.dataset(run["defaultDatasetId"]).list_items().items
print(f"Cleaned items: {len(items)}")

cURL

curl -X POST "https://api.apify.com/v2/acts/parsebird~data-cleaner/runs?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "jsonData": [
        {"name": "  JOHN DOE  ", "email": "  JOHN@CO.COM  "}
    ],
    "operations": [
        {"field": "name", "action": "trim_whitespace"},
        {"field": "name", "action": "normalize_case", "options": {"case": "title"}},
        {"field": "email", "action": "format_email"}
    ]
  }'

Output

Cleaned items retain their original structure. A stats key is stored in the key-value store:

{
    "totalLoaded": 5000,
    "totalCleaned": 4800,
    "operationsApplied": 7,
    "fieldsCleaned": 5,
    "totalChanges": 15200
}

Pricing

This Actor uses a pay-per-event pricing model.

Event	Price per event	Price per 1,000
`items-cleaned`	$0.00149	$1.49

Charged per 1,000 items loaded. Platform compute costs are additional.

Data Cleaner & Normalizer (JSON/CSV)

zenomastro/data-cleaner-normalizer

Clean and normalize JSON/CSV data: trim whitespace, lowercase emails, normalize phone numbers and dates, drop empty values/rows, and deduplicate by a field.

Rosario Vitale

CSV Formatter & Beautifier

anaselgamed/csv-formatter

Format and clean messy CSV data in one click. Fix delimiters, remove whitespace, normalize headers. Essential tool for data analysts and engineers.

Anas Hossam

CRM Lead Data Cleaner (Email/Phone Validator + Dedup)

motivational_nickel/universal-data-cleaner

Turn messy CSV or Excel leads into clean, validated, CRM-ready data. Fix Excel E+11 phone numbers, validate emails, remove duplicates, and score lead quality (HIGH, MEDIUM, LOW). Built for sales teams, lead gen agencies, and automation workflows.

Leoncio Jr Coronado

Superclean URLs

superlativetech/superclean-urls

Clean messy URLs from lead exports. Remove 60+ tracking parameters (utm_*, fbclid, gclid), normalize format, extract domains, and optionally verify URLs are reachable. Perfect for cold email personalization and CRM data hygiene.

Superlative

Data Deduplicator

parsebird/dataset-deduplicator

Merge and deduplicate Apify datasets by any field combination. Remove duplicate rows while keeping the first or last occurrence. Supports case-insensitive matching and whitespace trimming.

ParseBird

JSON Dataset Cleaner and Deduplicator

rodrgds/dataset-cleaner

Clean JSON datasets, remove empty rows, deduplicate by any field, validate emails, and prepare scraper output for CRMs, analysis, or AI workflows.

Rodrigo Dias

B2B Lead List Cleaner — Dedupe, Normalize & MX-Verify API

nexgendata/b2b-lead-list-cleaner

Clean messy B2B lead lists: normalize emails, flag role/disposable addresses, verify MX, dedupe by email or domain. Lower bounce rates before CRM import or outreach.

NexGenData

Phone Number Validation - Normalize & Check Numbers

benthepythondev/phone-number-validation

Validate, normalize and enrich phone numbers: E.164, international/national format, country code, region, location, carrier, timezone and number type.

Ben

CSV Deduper Normalizer

junipr/csv-deduper-normalizer

Deduplicate and normalize CSV-style rows. Clean whitespace, casing, domains, URLs, and emails, then output kept and duplicate rows plus clean CSV/JSON files.

junipr

Dataset Deduplicator

automation-lab/dataset-dedup

Merge and deduplicate Apify datasets by any field combination. Remove duplicates, keep first or last occurrence. Case-insensitive matching, whitespace trimming. Pay per 1K items processed.