Business Data Enricher — Clean, Match & Verify Listings avatar

Business Data Enricher — Clean, Match & Verify Listings

Pricing

Pay per usage

Go to Apify Store
Business Data Enricher — Clean, Match & Verify Listings

Business Data Enricher — Clean, Match & Verify Listings

Business data enrichment against Overture Maps POI data. Cleans and deduplicates by name + location, assigns stable GERS global IDs, grades data quality, flags leads (no website, unbranded). Resale-safe records. Territory mode pulls in bulk and tracks openings, closures and rebrands over time.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ryan Clinton

Ryan Clinton

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a minute ago

Last modified

Share

Business Data Enricher — messy lists in, clean businesses out

Clean, verify and enrich business lists at scale. Upload a messy CSV of businesses, suppliers, store locations or leads — this business data enrichment actor removes duplicates, verifies each record against global place data, fills in missing details (category, website, socials), and assigns a stable ID you can reuse in every future run. You get back clean records you own and can legally resell.

This is not a Google Maps scraper. You bring a dirty list (a CRM export, store locations, a supplier sheet, a local business data export you already paid for) and get back canonical, deduplicated, enriched businesses — verified against global place data, with a full audit trail behind every match.

What this is NOT. Not a Google Maps replacement — Overture Maps, the open place dataset behind this actor, is monthly-refreshed and carries no live reviews, ratings, opening hours, or popular times. For "today's hours and current star rating," Google Maps wins. This actor wins on bulk, legal resale, stable global IDs, and analytics — the four things a Maps scraper structurally cannot give you.

Who is this for?

  • Lead-generation agencies — find businesses with no website, no socials, or unbranded independents: ready-to-pitch lists straight out of the run.
  • Local SEO agencies — map a local market, benchmark competitor density, deduplicate a client's locations.
  • Business directories & marketplaces — dedupe listings and assign stable IDs so the same place never appears twice, and safely merge future data sources onto the same ID.
  • Data teams — clean and enrich a large business/place dataset against ground truth, with a stable join key for your warehouse.
  • Retail & franchise teams — territory coverage, brand penetration, and competitor density for a catchment area.
  • Market researchers — canonical place data you can legally build a product on.

What it does, in one example

Input — three messy rows:

Dominos Pizza, 54.581, -5.940
Domino's Pizza BT9, 54.581, -5.940
Maccies, 54.597, -5.930

Output — each row resolved to a clean, enriched record:

{
"name": "Domino's Pizza",
"category": "pizza_restaurant",
"brand": "Domino's",
"website": "dominos.co.uk",
"socials": ["instagram.com/dominos"],
"phone": "+44 28 …",
"gersId": "08f2…",
"confidence": 0.92,
"leadSignals": []
}

…the two "Domino's" rows collapse onto one record, and "Maccies" resolves to McDonald's via the brand short-circuit.

Result: 3 rows → 2 verified businesses, 1 duplicate removed, 100% matched — each enriched with category, website, socials and a stable ID you can pass back next run.

How accurate is it? Every match carries a confidence score and the exact reasons behind it. Uncertain matches go to a review queue instead of being guessed, and rows that don't match are returned explicitly — nothing is ever silently dropped.

Enrichment fields like website, socials and phone are filled where the source carries them — coverage varies by place and region (densest in US/EU and urban areas). The actor never invents a value; a field the source doesn't have comes back empty. Match rates likewise depend on your input quality and region; the run summary reports yours.

Sample output — matched, ambiguous and unmatched rows side by side

What you get — cleans your list, stable IDs, resale-safe, review queue

Business Data Enricher vs a Google Maps scraper

Most place-data actors extract listings. This one resolves them — and does the things a scraper structurally can't on a list you already have:

TaskGoogle Maps scraperBusiness Data Enricher
Get a list of places
Deduplicate your list
Stable IDs that survive re-runs
Legally resell the output
Territory / competitor analytics
Review queue for uncertain matches
Re-match your existing records
Live reviews, ratings, opening hours

The last row is the honest trade: for today's star rating and live hours, a Google Maps scraper wins. For cleaning, verifying and owning a list you already have, this does what a scraper can't.

Why choose this actor

  • Bulk — query over 100M+ places without result caps, pagination limits or blocking.
  • Legal resale — built on Overture Places under CDLA Permissive 2.0. Every record carries a resaleSafe flag and an attribution string. (We query the places theme only — never the share-alike ODbL themes.)
  • Stable global IDs — every matched row is stamped with a GERS ID, a persistent global fingerprint, so your data becomes joinable to any other dataset using the same IDs, forever. Re-runs are idempotent: pass the stored gersId back and the actor does a direct lookup instead of re-resolving.
  • Analytics — density, brand concentration, franchise footprint (per-brand saturation), market structure, whitespace and nearest-competitor analysis built right into the actor — impossible for a per-place scraper.

What you get — a messy list turned into clean, deduplicated, verified, enriched, resale-safe records

Clean and deduplicate your first business list in 60 seconds

Paste a list, press Start:

{
"places": [
{ "id": "row-1", "name": "Dominos Pizza", "lat": 54.581, "lng": -5.9398 },
{ "id": "row-2", "name": "Domino's", "lat": 54.5811, "lng": -5.9399 },
{ "id": "row-3", "name": "Maccies", "lat": 54.5972, "lng": -5.9301 }
],
"outputProfile": "enriched"
}

You get back canonical, deduped, enriched entities: the two Domino's rows collapse onto one GERS id (an entity-group record records the collapse), "Maccies" resolves to McDonald's via the brand short-circuit, and a run-summary record tells you the coverage, dedup, and review-queue headline. Every row is resale-safe.

Input modes (auto-detected)

You provideModeWhat you get
places (BYO list)resolutionone resolution record per input row + entity-group + review-item + run-summary
territoryQuery (a bbox)territoryevery canonical entity in the area + a territory analytics summary
any row carrying a gersIdidempotent re-matchdirect GERS lookup, cascade skipped

Resolution input

Each item in places: { "name": "...", "lat": 54.58, "lng": -5.93 } or { "name": "...", "address": "..." }. Optional id (echoed as inputId), category (improves the category gate), and gersId (idempotent re-match). Rows with no coordinates fall back to address-text matching at a lower, flagged confidence — but that's a full scan per row, so it's capped (supply lat/lng to resolve at scale; rows past the cap come back unmatched with an explanation, never silently dropped).

Territory input

territoryQuery: a bounding box "minLng,minLat,maxLng,maxLat", e.g. "-6.05,54.55,-5.80,54.65". Append a category filter after a pipe: "-6.05,54.55,-5.80,54.65 | coffee". Set outputProfile to territory, or just leave places empty.

Track what's changed in an area (event mode)

Run a territory with emitEvents: true and the actor diffs the current Overture release against an earlier one (over your bbox) and emits a commercial change feed — the thing a one-shot scrape can never give you:

  • Typed events per place: NEW_LOCATION, LOCATION_CLOSED, LOCATION_MOVED, REBRAND, CATEGORY_SHIFT, each with a severity score.
  • Brand expansion / contraction: which chains opened or closed net locations in the window (e.g. "Costa +4, Subway −2").
  • Market warnings: categories with a high closure rate, stated with the denominator ("8/12 closed"), never an investment verdict.
  • Successor candidates (opt-in includeSuccessors): a place closed and a new one opened at the same coordinates — flagged as a candidate with a confidence, never asserted.
  • A decision-first territory-digest record: openings, closures, expanding/contracting chains, warnings, lead count.

Leave compareRelease blank to auto-diff against the prior public release (a ~1-month window, available immediately). Set a watchlistName to snapshot each run into your own private history and track change across a longer window than the two public releases allow — the first run captures a baseline, changes are reported from the next run. The watchlist also builds a per-entity category timeline (categoryChangeHistory) and remembers your analyst review decisions (reviewDecisions input), echoing them back on the matching changes so a disposition survives re-runs.

Built on open data you can resell, this turns "scrape a place list" into "monitor a market" on the same engine.

Matching you can trust

Every place is matched on location, name and category together — never on name alone — and each match comes back with a confidence score and the reasons behind it, broken into its parts so you can see why it matched (or didn't).

  • Close calls go to a review queue instead of being guessed — the actor never silently picks between two plausible candidates.
  • Nothing is silently dropped — rows that don't match are returned explicitly as unmatched with the best near-miss, so you always see what didn't resolve and why.
  • Fully deterministic — the same input always produces the same result. No black box, no model drift, no surprises.

You can tune precision vs recall with the matchProfile preset (strict / balanced / lenient) without touching anything else.

Output profiles (outputProfile)

  • enriched (default) — the full record: match, canonical, quality, lifecycle, leadSignals, digitalPresence, resaleSafe, agentContract.
  • names — the lean display-name surface: { inputId, gersId, name, normalizedCategory, confidence (with decomposed components), ambiguity, runnerUpGap, status }. The right profile if all you persist is "canonical name + a stable key + a score to threshold on."
  • gers_only{ inputId, gersId, confidence, status }. The minimal join key for a warehouse that already holds the names.
  • audit — adds every candidate considered and why it was rejected. For tuning thresholds and proving matches.
  • territory — bulk-pull canonical entities + the analytics summary.

Record types

Discriminate on recordType: resolution | entity-group | review-item | run-summary | canonical-entity | territory-summary. The dataset ships decision-first views — Matched, Review queue, Unmatched, Run summary — and a KV SUMMARY record mirroring the coverage/dedup/review headline.

Data quality grades, match confidence and lead signals

  • Reason chainmatch.matchReason[] reads back the exact thresholds the cascade branched on.
  • Per-attribute corroborationmatch.matchEvidence is null until your input row carries that field, so nothing is fabricated on data you never gave.
  • Data-quality axisquality (grade A–F, completeness, issues) is distinct from match.confidence. A confidently-matched place can still have a defect-laden record; the two questions get two answers.
  • Lifecycle bandlifecycle.status is a descriptive band over evidence (operating status, low confidence), never a fabricated "closed" verdict.
  • Lead signalsleadSignals[] (NO_WEBSITE, NO_INSTAGRAM, UNBRANDED, INDEPENDENT, …) off already-fetched data. "Dentists in this metro with no website" is a ready-to-sell list at no extra cost.

Key inputs

InputDefaultNotes
matchProfilebalancedstrict (precision-first) / balanced / lenient (recall-first). A preset threshold pack, not a rule engine.
matchRadiusMeters150tighten (e.g. 50) for premise-accurate matches.
nameSimStrong / nameSimWeak0.89 / 0.83power-user overrides; matchProfile sets these.
overtureRelease2026-05-20.0the us-west-2 bucket retains only the latest ~2 releases.
minConfidence0.5drop ground-truth candidates below this Overture confidence.
includeLifecycle / emitLeadSignalstruecheap, on by default.
includeMarketContext / includeGlobalBrandStats / includeGraphEdgesfalseopt-in extra reads.

What this is not (stated up front)

  • Not a Google Maps replacement — no live reviews, ratings, or hours.
  • Coverage is uneven by region — best in US / EU / urban areas; thin-coverage regions match lower. The territory summary surfaces a coverageConfidence signal so you know where to trust the data.
  • Not a legal-entity → trading-name resolver — the cascade matches names as written. A divergent legal name vs trading name ("Bushmills Hotels Ltd" → "The Bushmills Inn") lands in the review queue, exactly as a plain fuzzy+spatial resolver would. The brand short-circuit rescues chains; divergent-name independents still need a human.
  • Not a geocoder — no-coord rows get flagged address-text matching (capped — each is a full scan), not precise geocoding.

Cost

Free to run during launch. No proxies and no per-place fees — reads of the public Overture data are anonymous and sponsored by AWS, so you pay only Apify platform compute.

Attribution

Built on Overture Maps Places data, CDLA Permissive 2.0. Every output record carries the attribution string and a resaleSafe flag.

Deliver results to Slack or Notion (MCP connectors)

Optionally pipe each run's decisions — the resolution/territory digest plus the ranked review worklist — straight into your own Slack channel or Notion workspace. You never hand this actor a token: you connect Slack/Notion once in Apify Console → Settings → MCP connectors (Notion is one click), and Apify proxies the credentials. The actor only ever receives a connector id.

  • Notion — set notionConnector. Get a one-page resolution report (digest + top review items), or a page per review item with notionArchiveProfile: per-review.
  • Slack — set slackConnector (and optionally slackChannel). The digest is posted, plus review items at or above slackMinReviewPriority (default 50) so the channel stays signal-only. (Slack connectors need you to register your own Slack OAuth app.)

Only the decisions are delivered — never the bulk resolution rows. Leave the connector fields empty and the run behaves exactly as before. The delivery outcome is reported back on the run-summary's deliveries block.