Product Matcher avatar

Product Matcher

Pricing

from $8.00 / 1,000 successful matches

Go to Apify Store
Product Matcher

Product Matcher

Cross-shop product matcher: returns a confident match or nothing, never a wrong pair. Give it a sample product URL and target e-shop domains; it finds the same product on each, verified by EAN check-digit, spec agreement, AI variant judge and photo match, with a 0-1 confidence score.

Pricing

from $8.00 / 1,000 successful matches

Rating

0.0

(0)

Developer

Jan Hilgard

Jan Hilgard

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Cross-shop product matching that would rather return nothing than the wrong pair.

Give it a sample product and a list of shops. It finds the same product on each shop — discovers the candidate itself, verifies it with several independent signals, and returns a confidence score with a reason for every decision. When it is not sure, it says so and returns no URL. No silent errors. No "closest guess".

Most matchers hand you the nearest result and a number. This one fails loud on an uncertain match. I built it that way because a wrong pair is worse than no pair — a bad match quietly poisons a price feed or a catalogue mapping for days before anyone notices, and by then you have already made decisions on it.

Keywords: product matching, cross-shop product matching, confidence-scored product matching, variant-aware matching, EAN / GTIN matching, product deduplication, competitor catalog matching.


What a real run looks like

Sample: Apple iPhone 17 512 GB, Lavender. Target shop: one large electronics retailer. The shop sells many things that look almost right. Here is what the matcher does with the candidates it found:

✗ iPhone 17 256 GB Lavender rejected — different capacity (256 vs 512 GB)
✗ iPhone 17 Pro 512 GB Lavender rejected — different model (Pro ≠ base 17)
✗ iPhone 17 512 GB (refurbished) rejected — opened-box / refurbished, not a new-retail listing
✗ iPhone 17 512 GB Orange rejected — color + photo mismatch (triple camera)
✗ iPhone 17 512 GB (no image) rejected — placeholder image, photo check inconclusive
✓ iPhone 17 512 GB Lavender MATCH confidence 0.97
consensus: specs + AI variant judge + photo (3 independent signals)
reason: same model, capacity and color; product photo confirms the variant

Five rejections with reasons before one confident match — that's the product. Anything can return the sixth line. The value is the five lines above it, each with a reason you can read.


Why it is different

  • No silent errors. An uncertain candidate is returned as matched_url: null with a reason, never as a maybe-match. The bar to call something a match is deliberately high (precision-first). You can trust a true without re-checking it by hand.
  • Discovery + match, not just match. It does not need you to already have both catalogs. From the sample alone it finds the candidate on each target shop (web search → sitemap → the shop's own search box), then verifies it. A plain matcher that only compares two finished datasets cannot do the finding part.
  • Variant-aware. 512 GB vs 256 GB, base vs Pro, lavender vs orange, new vs refurbished — these are different products, and they are exactly where naive name/price matching breaks. The whole design exists to get the variant right.
  • Every decision is explainable. Confidence score, which signals fired, which layer decided, and a short human reason — per shop, in the output row.

How it works

Per shop, from input to decision:

  1. Find candidates on the shop. The actor goes and finds the products that could be the match on the target shop itself — discovery, not pairing a dataset you already have.
  2. Cheap pre-filter. Obviously-different products are dropped with cheap checks before any expensive verification runs, so effort — and error — is not spent on things that were never going to match.
  3. Verify the survivors with independent signals. Each remaining candidate is checked by several independent signals (the signals below), not one test.
  4. Consensus. The confidence score is how many of those independent signals agree, not one model's opinion (see the scoring below).
  5. Decide or reject. Above the threshold it returns the match with a readable reason; below it, it returns null — never a guess.

Why this is accurate: no decision rests on a single signal, the deterministic EAN/GTIN check takes precedence wherever a code exists, and the visual check catches a wrong variant that text alone would miss. That is why it would rather return nothing than the wrong pair.

For each target domain it discovers candidate URLs, then verifies the best ones with independent signals and takes a consensus — the score is not one model's opinion, it is how many independent checks agree:

SignalWhat it checks
eanEAN / GTIN match, including check-digit validity
specsstructured spec agreement (brand, model, capacity, …)
ai_judgean AI variant judge reads both listings: same product or not, and why
photoindependent product-photo comparison of the matched variant

confidence_score is 01 and reflects the consensus:

EAN match ........................ 0.99
3 independent signals ............ 0.97
2 independent signals ............ 0.92
1 signal ......................... 0.85
below threshold → url: null ...... "uncertain", returned but not a match

A candidate that no positive layer confirms is returned with matched_url: null. That is a feature, not a gap.

Mechanically: the actor is a thin client. The heavy work (headed browser, search, LLM judging, photo comparison) runs on the data.hilgard.cz service. The actor starts a resumable job (POST /api/match/jobs{ jobId }), streams its progress over SSE (GET /api/match/jobs/:id/stream?cursor=N), and writes results to the dataset. A match can run minutes to hours, so the job runs server-side independently of the connection: if the stream drops, the actor reconnects to the same job from the last event it saw (cursor) — no lost work, no duplicate logs. If the service has no job API, it falls back to the legacy synchronous POST /api/match stream.


Input

{
"productPairs": [
{
"url": "https://www.alza.cz/apple-iphone-17-512gb",
"domains": ["datart.cz", "mall.cz"] // 1–20 target domains
}
],
"includeDebug": false, // optional, keep the _debug block
"requestTimeoutSecs": 3600, // optional, total per-pair budget across reconnects
"maxReconnects": 20, // optional, give up after N stalled reconnects
"reconnectBackoffSecs": 3 // optional, delay between reconnects
}

Each item in productPairs is one matching job. url must be an http(s) URL and domains must contain 1–20 non-empty domains. Pairs are processed sequentially. requestTimeoutSecs is the total budget for one pair across all reconnects; maxReconnects only bounds consecutive reconnects that make no progress, so a healthy long-running match is not cut short.

Output

One dataset row per target domain, with the sample product's profile inlined:

{
"source_url": "https://www.alza.cz/apple-iphone-17-512gb",
"source_domain": "alza.cz",
"product_name": "Apple iPhone 17 512 GB",
"brand": "Apple",
"model": "iPhone 17 512GB",
"ean": "1949483485734",
"source_price": 27990,
"currency": "CZK",
"image_url": "https://...",
"matched_domain": "datart.cz",
"matched_url": "https://datart.cz/iphone-17-512gb", // null when no confident match
"confidence_score": 0.99,
"method": "sitemap", // sitemap | brave | google | null
"reason": "EAN match",
"decided_by": "ean", // ean | ai_variant_judge | photo | null
"signals": ["ean"], // ean | specs | ai_judge | photo
"matched_fields": [{ "field": "ean", "agree": true }],
"price_candidate": 26490, // observed price on the matched shop
"scraped_at": "2026-06-18T10:00:00.000Z",
"success": true // false when matched_url is null
}

When includeDebug is true, each row also carries the service's _debug block. If a pair fails entirely, one placeholder row per requested domain is written with success: false and an error message.


Use cases

  • Pricing intelligence. Map your product to the same product on competitor shops, then compare prices on pairs you can trust. price_candidate is the observed price on the matched shop; the comparison is yours to make.
  • Competitor monitoring. Track whether — and where — a given product appears across a set of shops over time.
  • Catalog deduplication / matching. Link the same product across two catalogs, variant-correct, with a confidence you can threshold on.

It is built for correctness on the hard cases (variants, refurbished, near-misses), not for sheer volume. Pick it when getting the pair right matters more than getting a pair at any cost.


Pricing — two tiers

This actor uses Pay Per Event, with two events:

  • shop-checked — a small fee for each shop we check.
  • successful-match — the main fee, charged additionally when a product is confirmed on a shop (a success: true row with a matched_url, confidence above the threshold).

The shop-checked fee pays for discovery, not for a miss. Most matchers take two finished catalogs you already have and pair them. This one does not assume you have the other side — it goes and finds the product on each shop itself: web search, anti-bot fetch, vision, multi-signal verification (EAN/GTIN, specs, AI variant judge, photo). That search is the expensive part, and it runs in full whether or not the product turns out to be on the shop. So you pay a small fee per shop checked for that work, and the main fee only when a shop yields a confirmed match.

To be clear upfront: a shop where we find nothing still costs the shop-checked fee, because the discovery work was done there too. A confirmed match costs shop-checked + successful-match.

No flat fee. No per-SLA pricing. You are billed per event, for work actually done. The current price of each event is in the Apify Console pricing tab.


When NOT to use this

I would rather you not waste money, so:

  • You already have both full catalogs (your products + theirs), cleanly keyed. Then you do not need discovery — a plain dataset-to-dataset matcher is cheaper. This actor earns its price when you don't have the other side and need it found and verified.
  • You only have a handful of one-off products. Just paste them into the data.hilgard.cz UI and read the result. No actor run needed.
  • You want the cheapest possible "good enough" matching at huge volume. This one optimizes for being right on hard variants, not for being the cheapest. If a few wrong pairs in a thousand are acceptable to you, a lighter tool will cost less.

If the correctness story above is not what you need, one of these is the better call.


About

Built on the data.hilgard.cz matching engine — the same self-healing stack that does anti-bot fetching, model-by-meaning extraction, and independent verification, applied to cross-shop matching.

By Jan Hilgard (founder of Hosting90, acquired; core contributor to vllm-mlx). The precision-first stance is deliberate: I would rather ship a matcher you can trust on the hard cases than one that always answers.

Development

npm install
npm run build # tsc → dist/
npm run start:dev # tsx src/main.ts, reads .actor/INPUT.json