CRM Contact Cleanup & Dedupe Prep avatar

CRM Contact Cleanup & Dedupe Prep

Pricing

Pay per usage

Go to Apify Store
CRM Contact Cleanup & Dedupe Prep

CRM Contact Cleanup & Dedupe Prep

Clean supplied URL, email, and address fields for contact records, preserving one row per input with changed-field, review, dedupe-key, and cross-field signals. Does not scrape, find, verify, enrich, geocode, score confidence, choose survivors, or merge contacts.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Critical Distinction

Critical Distinction

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Contact Cleanup & Dedupe Prep for Supplied Records

Prepare contact records you already have for CRM review, lead-list quality checks, matching, or downstream cleanup. This Actor keeps one dataset row per input record, normalizes supplied URL, email, and U.S.-leaning address values, and returns the changed-field evidence, review flags, invalid reasons, warnings, and match keys next to the original row.

Use it when you need a deterministic cleanup pass before human review or downstream matching decisions:

  • Normalize supplied website, email, and address fields without live data sourcing.
  • Keep invalid, partial, and no-actionable rows visible instead of dropping them from the output.
  • Prepare dedupe keys and same-run candidate labels while preserving every input row.
  • Route messy records to review with diagnostics, not with a confidence score or final truth verdict.

It does not scrape or find contacts, verify email deliverability, certify postal addresses, geocode, enrich records, remove duplicates, choose survivors, or merge CRM records.

Details below cover what you get, input parameters, output format, fixture-backed output examples, how to read common rows, limitations, permissions, pricing, and release history.

What You Get

  • Preserves one dataset row per supplied input record.
  • Processes only the supported first-release fields: recordId, url, email, and address.
  • Normalizes URLs into canonical http or https values with stable host/path/query behavior and tracking-fragment cleanup.
  • Normalizes emails into stable lowercase and canonical mailbox/domain values, including deterministic Gmail alias handling.
  • Normalizes U.S.-leaning address text into display and comparison forms through the shared address-normalization boundary.
  • Emits required recordStatus values: ready, normalized, review_needed, invalid, and no_actionable_input.
  • Emits changedFields, reviewFlags, dedupeKeys, crossFieldSignals, warnings, invalidReasons, and processing details so buyers can see why each row landed where it did.
  • Writes a structured OUTPUT summary with selected controls, row counts, status distribution, diagnostic summaries, first-release limits, and explicit unsupported-capability booleans.

The output is meant to help a buyer decide what to review next. A clean row stays clean, a changed row explains what changed, a messy row keeps the original value next to diagnostics, and a possible same-run match is shown as a candidate label instead of being removed or merged.

Operating Boundaries

This README describes the product behavior of a single supplied-record cleanup run. The Actor writes one dataset row per input record and a structured OUTPUT summary; it does not configure recurrence, send alerts, replace workflows, call sibling Actors, change Store pricing, publish or unpublish itself, or mutate legacy Actor registrations.

Cost and billing are described in pricing. This README does not claim a fixed per-record price, Pay Per Event readiness, Store launch completion, live monitoring, workflow ownership, or legacy Actor disposal authority.

Input Parameters

ParameterTypeDefaultDescription
recordsarray of objectsrequiredSupplied contact-like records. Each record may include recordId, url, email, and address. The first-release limit is 1,000 records per run.
fieldGroupsarray of strings["url", "email", "address"]Selects which supplied field families to process. Allowed values are url, email, and address. This is not a stage, child Actor, scraping, or live-enrichment control.
reviewStrictnessstringstandardControls which deterministic review observations promote a row to review_needed. Allowed values are minimal, standard, and strict. It does not create confidence scoring, verification, survivor choice, or merge authority.
dedupeKeyModestringkeys_onlyControls whether match-prep keys and same-run exact candidate labels are emitted. Allowed values are off, keys_only, and keys_and_candidates. It never removes rows, ranks records, chooses survivors, or merges contacts.

reviewStrictness changes routing pressure only. It does not hide changed fields, invalid reasons, warnings, dedupe keys, or cross-field signals. dedupeKeyMode changes whether match-prep evidence is shown; it does not change output cardinality or mutate any downstream system.

Example input:

{
"records": [
{
"recordId": "acme-hq",
"url": " Example.com/about/?utm_source=newsletter&b=2&a=1#team ",
"email": "Sales+ops@Acme.com",
"address": " 123 Main St., Suite 200, Austin, Texas 78701-1234 "
},
{
"recordId": "partial-invalid",
"url": "https://valid.example",
"email": "not-an-email"
}
],
"fieldGroups": ["url", "email", "address"],
"reviewStrictness": "standard",
"dedupeKeyMode": "keys_only"
}

Output Format

This Actor writes two outputs:

  • The default dataset contains one cleanup row per supplied record.
  • The OUTPUT key-value record contains aggregate counts, selected controls, first-release limits, and explicit non-claim booleans.

Use the dataset when reviewing individual records. Use OUTPUT when you need run-level counts, selected controls, diagnostic totals, and a machine-readable reminder of unsupported capabilities.

Each dataset item includes:

{
"recordId": "acme-hq",
"inputIndex": 0,
"recordStatus": "review_needed",
"input": {
"recordId": "acme-hq",
"url": " Example.com/about/?utm_source=newsletter&b=2&a=1#team ",
"email": "Sales+ops@Acme.com",
"address": " 123 Main St., Suite 200, Austin, Texas 78701-1234 "
},
"normalized": {
"url": {
"canonicalUrl": "https://example.com/about?a=1&b=2",
"scheme": "https",
"host": "example.com",
"path": "/about"
},
"email": {
"normalizedEmail": "sales+ops@acme.com",
"canonicalEmail": "sales+ops@acme.com",
"domain": "acme.com"
},
"address": {
"normalizedAddress": "123 main street ste 200 austin tx 78701",
"comparisonAddress": "123 main street austin tx 78701",
"addressType": "street",
"state": "tx",
"postalCode": "78701",
"postalCodeExtension": "1234",
"secondaryUnitDesignator": "ste",
"secondaryUnitIdentifier": "200"
}
},
"changedFields": [
{
"fieldGroup": "url",
"sourceField": "input.url",
"targetField": "normalized.url.canonicalUrl",
"originalValue": " Example.com/about/?utm_source=newsletter&b=2&a=1#team ",
"normalizedValue": "https://example.com/about?a=1&b=2",
"reasonCodes": [
"trimmed_whitespace",
"assumed_https",
"removed_fragment",
"removed_tracking_parameters",
"sorted_query_parameters",
"removed_trailing_slash"
]
}
],
"reviewFlags": [
{
"fieldGroup": "email",
"sourceField": "derived.email.roleAccount",
"sourceValue": true,
"flagCode": "email_role_account",
"severity": "medium",
"strictnessThreshold": "standard",
"message": "Email local part looks like a role or team mailbox."
}
],
"dedupeKeys": [
{
"fieldGroup": "email",
"keyFamily": "email_domain",
"keyValue": "acme.com",
"matchScope": "organization",
"keyStrength": "context",
"sourceFields": ["normalized.email.domain"],
"sourceValues": ["acme.com"],
"candidate": null
}
],
"crossFieldSignals": [
{
"signalCode": "url_email_domain_mismatch",
"fieldGroups": ["url", "email"],
"severity": "medium",
"sourceFields": ["normalized.url.host", "normalized.email.domain"],
"sourceValues": ["example.com", "acme.com"],
"reviewStrictnessThreshold": "standard",
"statusImpact": "promotes_review_needed",
"message": "Website host and email domain do not line up under the deterministic domain comparison rule; review before treating them as the same organization context."
}
],
"warnings": [],
"invalidReasons": [],
"processing": {
"enabledFieldGroups": ["url", "email", "address"],
"reviewStrictness": "standard",
"dedupeKeyMode": "keys_only",
"fieldStates": {
"url": {
"enabled": true,
"inputState": "nonblank",
"resultState": "usable"
},
"email": {
"enabled": true,
"inputState": "nonblank",
"resultState": "usable"
},
"address": {
"enabled": true,
"inputState": "nonblank",
"resultState": "usable"
}
}
}
}

The OUTPUT summary includes:

{
"schemaVersion": "contact-cleanup-output-v1",
"selectedControls": {
"fieldGroups": ["url", "email", "address"],
"reviewStrictness": "standard",
"dedupeKeyMode": "keys_only"
},
"inputCount": 2,
"emittedCount": 2,
"emittedEqualsInputCount": true,
"statusCounts": {
"ready": 0,
"normalized": 0,
"review_needed": 2,
"invalid": 0,
"no_actionable_input": 0
},
"rowOutcomeCounts": {
"rowsWithUsableOutput": 2,
"rowsWithChangedFields": 2,
"rowsWithReviewFlags": 2,
"rowsWithDedupeKeys": 2,
"rowsWithDuplicateCandidates": 0,
"rowsWithCrossFieldSignals": 1,
"rowsWithWarnings": 0,
"rowsWithInvalidReasons": 1
},
"firstReleaseLimits": {
"recordsMaxItems": 1000,
"supportedInputFields": ["recordId", "url", "email", "address"],
"supportedFieldGroups": ["url", "email", "address"],
"defaultFieldGroups": ["url", "email", "address"],
"reviewStrictnessValues": ["minimal", "standard", "strict"],
"defaultReviewStrictness": "standard",
"dedupeKeyModeValues": ["off", "keys_only", "keys_and_candidates"],
"defaultDedupeKeyMode": "keys_only",
"cleanupProfileExposed": false,
"defaultOutputCardinality": "one_row_per_input_record"
},
"nonClaimSummary": {
"emailDeliverabilityVerified": false,
"inboxExistenceVerified": false,
"inboxOwnershipVerified": false,
"missingContactsFound": false,
"sourceScrapingPerformed": false,
"externalEnrichmentPerformed": false,
"websiteReachabilityChecked": false,
"postalDeliverabilityCertified": false,
"geocodingPerformed": false,
"demographicEnrichmentPerformed": false,
"confidenceScoringPerformed": false,
"crmMergePerformed": false,
"automaticSurvivorshipSelected": false,
"liveFreshnessChecked": false
},
"runDurationSeconds": 0.0
}

The real OUTPUT object also includes detailed field-group activity, changed-field, review-flag, dedupe-key, cross-field-signal, warning, and invalid-reason summaries.

Row Statuses

StatusMeaningHow to use it
readySupported supplied values were already usable under the deterministic rules.Treat as cleanup confirmation, not live verification.
normalizedAt least one supplied value changed into a usable normalized value without review pressure.Review changedFields if you need to explain the transformation.
review_neededThe row has usable output plus review pressure such as a partial invalid field, role/disposable email observation, duplicate candidate, or cross-field signal.Review before matching, importing, or trusting the row in another system.
invalidEnabled nonblank supplied values could not be normalized into usable output.Use invalidReasons as diagnostics; the row is preserved on purpose.
no_actionable_inputThe row had no enabled nonblank url, email, or address value to process.Keep or remove it according to your own source-system rules.

Key Output Fields

FieldWhat it tells youBoundary
normalizedCanonical URL, canonical email/domain, and U.S.-leaning address display/comparison values when available.Deterministic cleanup only; no reachability, deliverability, ownership, postal, or geocoding proof.
changedFieldsWhich supported values changed and which reason codes explain the change.Explains transformations; it is not a correctness score.
reviewFlagsDeterministic observations that can make a row worth human review.Review routing only; not a confidence score or truth verdict.
dedupeKeysURL, email, domain, and address comparison keys, plus optional same-run candidate labels.Match preparation only; no duplicate removal, ranking, survivor choice, or CRM merge.
crossFieldSignalsDeterministic prompts from relationships between supplied fields, such as URL/email domain mismatch.Review prompts only; not identity, ownership, fraud, or legal proof.
warningsNon-blocking diagnostics such as unknown address shape or disabled field group context.Keeps uncertainty visible without claiming the row is wrong.
invalidReasonsField-level reasons why supplied nonblank values could not be normalized.Diagnostics only; invalid output is not a failed run by itself.
processingControls and per-field input/result states used for the row.Helps explain routing; not a pricing or support guarantee.

Deterministic Smoke Behavior

The committed smoke input at .actor/smoke_input.json uses only deterministic .example records. It covers ready, normalized, no-actionable, invalid, mixed valid/invalid, duplicate-candidate, cross-field, review, same-address, and warning-shaped rows without depending on third-party network state.

Fixture-Backed Output Examples

The repo includes a detailed ./docs/output-examples.md for the committed 12-record smoke fixture. That pack traces examples to the smoke input, aggregate contract test, and saved smoke/cost-matrix run summaries.

The fixture summary is:

Example surfaceCount
Input records12
Dataset rows emitted12
ready rows2
normalized rows1
review_needed rows7
invalid rows1
no_actionable_input rows1
Rows with duplicate candidates4
Rows with cross-field signals3
Rows with warnings1
Rows with invalid reasons2

Use the examples as support guidance for reading output. They show how duplicate candidates, domain mismatch signals, invalid diagnostics, no-actionable rows, and address warnings preserve review evidence without removing rows or claiming verification, enrichment, address authority, confidence scoring, automatic dedupe, or CRM merge.

How To Read Common Rows

Ready or normalized rows Use the normalized values and changed-field ledger as cleanup evidence. Do not treat the row as proof that a website is reachable, an inbox exists, an address is deliverable, or a contact is current.

Invalid rows Invalid rows are emitted intentionally when supplied nonblank values cannot be normalized. The row keeps the original input and explains the problem in invalidReasons so you can fix the source record or route it for manual review.

Partial rows A row can contain usable output for one field group and invalid diagnostics for another. Keep reading the full row before discarding it; the useful field groups remain available.

No-actionable rows Blank, null, missing, or disabled field groups can produce no_actionable_input. These rows preserve input cardinality. With pay-per-usage billing, the run can still consume platform resources even when a row has no candidate processed-contact event.

Duplicate-candidate rows When dedupeKeyMode is keys_and_candidates, rows can point at same-run peers with the same canonical URL, canonical email, or address comparison key. That is a review queue, not an automatic dedupe result.

Cross-field signal rows Signals such as URL/email domain mismatch or same-address context are deterministic prompts from supplied values. They should guide review, not be used as verified identity, ownership, or fraud findings.

Warning rows Warnings keep uncertain context visible, such as an address shape that could not be confidently parsed into common components. A warning can coexist with usable output.

Example Use Cases

CRM intake review Normalize supplied website, email, and address values while preserving invalid input and row-level review pressure for human triage.

Lead-list cleanup before matching Create stable URL, mailbox, domain, and address comparison keys before joining records against another system.

Dedupe preparation without merge authority Emit deterministic keys and same-run exact candidate labels while keeping every original row and avoiding automatic survivor selection.

Limitations

  • This Actor only processes supplied records. It does not scrape websites, crawl pages, find missing contacts, or fetch external enrichment from providers.
  • URL cleanup does not prove website reachability, safety, ownership, live freshness, redirect equivalence, or page content.
  • Email cleanup does not verify deliverability, inbox existence, inbox ownership, mailbox ownership, or sender compliance.
  • Address cleanup is U.S.-leaning heuristic normalization and comparison-key preparation. It does not certify postal deliverability, geocode addresses, add demographics, or prove that an address belongs to a contact.
  • Dedupe keys and same-run candidate labels are preparation signals. They do not cluster records, rank matches, choose survivors, remove duplicates, merge CRM records, or mutate downstream systems.
  • recordStatus is deterministic routing evidence, not a confidence score, CRM truth verdict, legal identity claim, pricing signal, or support guarantee.
  • High review_needed counts can reflect messy supplied input or strict review settings. They do not mean the run failed or that the Actor verified those rows as bad.
  • No-actionable rows preserve cardinality. They are useful for audit trails, but they still make pay-per-usage cost harder to estimate from record count alone.
  • Phone, company-name cleanup, CSV/CRM import, arbitrary metadata, fuzzy scoring, live verification, enrichment, and automatic merge controls are outside the first-release contract.

Disclaimer

This Actor performs deterministic cleanup and match-preparation only. Use its output as structured evidence for review, matching, or downstream workflow decisions, not as proof that a contact is current, reachable, deliverable, enriched, owned, merged, or CRM-true.

Permissions

This Actor is designed to run with limited permissions. It writes only to its default dataset and default key-value store. The current runtime does not require access to other Apify storages, account resources, sibling Actors, proxy groups, or third-party network resources.

Pricing

Recommended pricing model: Pay per usage. Under this model, users pay Apify platform resource costs for the run, and there is no custom developer charge from this Actor.

Pay-per-usage is less predictable before a run than a fixed per-record quote. Start with a limited-scope trial, review the run cost and OUTPUT counts, then scale only if the review density and platform usage match your workflow.

This README does not configure Store pricing, claim a fixed per-record price, or promise custom Pay Per Event charging. Any event names that appear in internal design material are not buyer billing terms unless a live pricing surface says so.

Release History

See ./CHANGELOG.md for version-by-version release notes and migration guidance.