Product Detail & Price Parser
Pricing
$2.00 / 1,000 url parseds
Product Detail & Price Parser
Reliable product detail & price extraction from a URL, even on shops behind heavy anti-bot protection. Returns name, price, currency, stock, EAN and image with per-field confidence and a fail-loud reliable flag — never silently wrong. Built on the data.hilgard.cz self-healing engine.
Pricing
$2.00 / 1,000 url parseds
Rating
0.0
(0)
Developer
Jan Hilgard
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
13 hours ago
Last modified
Categories
Share
Reliable product & price extraction from a URL that fails loud — never silently wrong.
Give it a product URL. It returns a clean, structured record — name, price, currency,
stock, EAN, image — with a confidence on every field and a single reliable flag.
When a field does not clear the bar you get reliable: false and a null with a stated
reason, not a guessed value. It never hands you a plausible-looking price that is quietly
wrong, and never an empty cell dressed up as a successful parse. That guarantee is the
product; the engine underneath is just how it is kept.
Most scrapers are a CSS selector and a hope. The selector breaks the day the shop ships a redesign, and you find out weeks later from a wrong price in a feed. This one is built the opposite way: it cross-checks what it extracted, scores its own confidence, and refuses to call a shaky extraction reliable. A wrong price is worse than a missing one — a missing one you notice, a wrong one you act on.
Keywords: no silent errors, per-field confidence, validated extraction, confidence-scored price, EAN / GTIN check-digit, product detail extraction, price scraping, structured product data, e-commerce price parser, self-healing scraper, Cloudflare scraper, DataDome bypass, anti-bot scraping, scrape behind WAF.
Why this beats self-healing / AI scrapers
Self-healing and LLM scrapers adapt to a layout change — that part is table-stakes now,
this one does it too. But adapting is not the same as being right: a scraper can re-find
"a price" after a redesign and still hand you a plausible-but-wrong field — the wrong
variant, a stale crossed-out price, a swapped currency — and say nothing. That silent
wrong value is the expensive one. The difference here is the guarantee, not the healing:
every field carries its own confidence, an EAN/GTIN must pass its check digit, and below
the bar the field comes back reliable: false instead of a guess. An honest fail you can
catch beats a confident mistake you cannot. Self-healing is how the field gets re-found;
the confidence and validation are what decide whether you are allowed to trust it.
What a real run looks like
One URL in, one structured row out — with the engine's own confidence attached.
(a) The layout changed. A selector-based scraper would silently break. This doesn't.
URL: https://shop.example/iphone-17-512gb (shop redesigned its price markup last week)✓ product_name "Apple iPhone 17 512 GB Lavender"✓ price 27990 field confidence: high✓ currency "CZK"✓ ean "1949483485739" field confidence: high (check-digit valid)reliable: true overall confidence: 0.96note: HTML price node moved; the Vision pass read the price off therendered page and the two agreed — extraction self-healed.
A hard-coded selector would have returned null (or worse, grabbed the crossed-out
old price). The Vision cross-check caught the moved element and confirmed it.
(b) The page is ambiguous. It does NOT guess — it fails loud.
URL: https://shop.example/landing/iphone-deals (a promo page, not one product)✗ price null field confidence: lowreliable: false overall confidence: 0.28success: falsenote: multiple prices on the page, none clearly the product's — returnedunreliable with low confidence rather than a guessed number.
The value is that second row. A cheaper tool would have returned a price here — some price on the page — and you would have trusted it.
Why it is different
- No silent errors. Every field carries a confidence, the whole record carries a
reliableflag, and codes are validated — an EAN/GTIN must pass its check digit to be trusted. Below the confidence bar the field is returnednullwith low confidence andreliable: false, never a confident-looking wrong value. - Wrong input gets an honest answer too. Point it at a URL that is not a product
detail page — a category listing, a promo landing page, a real-estate advert — and
it does not hallucinate a product out of whatever it found. It returns
success: falsewitherror: "not a product detail page (entity: …)". Even a bad input gets the truth, not a fairy tale. Same no-silent-errors rule, applied to the page itself. - Self-healing — a mechanism that serves the correctness above, not the headline. The engine cross-checks the HTML reading against a Vision reading of the rendered page, so a layout change that breaks a selector is caught instead of silently returning the wrong node.
- Gets through where plain scrapers hit a wall. The heavy anti-bot protections — DataDome, Cloudflare, Akamai, PerimeterX and similar — return a challenge page instead of data to a simple fetch. This reaches the actual product page through them, so you get the product, not a CAPTCHA. (Anti-bot is an arms race, so this is a capability that handles these protections, not a guarantee against any one named vendor.)
- Predictable mid price. One flat price per URL parsed. Not the cheapest unreliable scraper where a few wrong values per thousand are "fine", and not an enterprise per-seat contract. You pay one predictable price and get validated data or an honest confidence — the self-healing and the validation are why it sits in the middle, not at the bottom.
How it works
Per URL, from input to record:
- Fetch — get through to the real page. Reach the product page through the heavy anti-bot / WAF layers (DataDome, Cloudflare, Akamai, PerimeterX and the like) and render JS-driven content, so the extractor sees the actual product the way a real browser would — not a challenge page, an interstitial, or an empty SPA shell.
- Extract by meaning. Read the product fields by what they mean on the page, not by a fixed selector path that a redesign would break.
- Cross-check & validate. Reconcile independent readings of the same field (HTML and Vision) — this HTML+Vision cross-check is the self-healing part, and it is here to serve correctness — score a per-field confidence, and run deterministic checks where they exist — an EAN/GTIN must pass its check digit.
- Fail loud below the bar. When confidence does not clear the threshold, the
field is returned
nulland the record is markedreliable: false, with the reason — rather than emitting a guess.
This is the principle, not a recipe: the actor deliberately exposes no thresholds, weights, or model names, because those are tuned on the engine side and change.
Mechanically: the actor is a thin client. The heavy work (anti-bot fetch, headed
browser, model-by-meaning extraction, Vision cross-check, validation) runs on the
data.hilgard.cz service. The actor sends the URL to
POST /api/parse, streams the engine's progress over SSE, and writes one validated
row to the dataset. A fresh extraction can take minutes; if the stream drops before a
result, the actor retries the parse within the per-URL budget.
Input
{"urls": ["https://www.alza.cz/apple-iphone-17-512gb" // 1–N product detail URLs],"forceFresh": false, // optional, force a fresh extraction (skip cache)"includeDebug": false, // optional, keep the full _debug block in each row"requestTimeoutSecs": 360, // optional, per-URL budget across retries (hard-capped at 600)"maxRetries": 3, // optional, re-parse attempts after a dropped stream"retryBackoffSecs": 3 // optional, delay between retries}
Each entry in urls must be an http(s) URL and is parsed independently, in order.
forceFresh: false (the default) lets the service serve a fast cached extraction when
it has one; set it true to force a full fresh self-healing extraction every time.
requestTimeoutSecs is the per-URL budget across retries (default 360s); a single parse
is hard-capped at 600s no matter what you set, so one pathologically slow page is aborted
at the cap and reported as a fail-loud error (per-URL time budget exceeded) — and a URL
cut off this way produces no result, so it is not charged.
Output
One dataset row per URL:
{"url": "https://www.alza.cz/apple-iphone-17-512gb","domain": "alza.cz","product_name": "Apple iPhone 17 512 GB","price": 27990,"currency": "CZK","in_stock": true,"ean": "1949483485739","image_url": "https://...","reliable": true,"overall_confidence": 0.96,"fields": {"product_name": { "status": "confirmed", "score": 1 },"price": { "status": "high", "score": 0.97 },"currency": { "status": "confirmed", "score": 1 },"in_stock": { "status": "confirmed", "score": 1 },"ean": { "status": "confirmed", "score": 1 },"image_url": { "status": "single", "score": 1 }},"entity": "product","scraped_at": "2026-06-18T10:00:00.000Z","success": true}
The product payload is exactly the fields the engine reliably yields — product_name,
price, currency, in_stock, ean, image_url. It does not invent brand /
model / specs columns; you get what the page actually supports, validated, rather
than padded with empty fields. domain is normalized without a www. prefix.
reliable and overall_confidence (0–1) are the top-level trust signals. Under
fields, every product field carries its own status and numeric score (0–1) — and
every field appears, including the ones that came back empty: a missing ean is
{ "status": "absent", "score": null }, never silently omitted, so a null always has a
stated reason.
success vs reliable: success says the parse technically produced a product
record from a product page; reliable says that record cleared the engine's trust
threshold. They diverge when a product was extracted but the engine is not confident in
it — then success is true while reliable is false, and the per-field scores stay
low, so a hedge never reads as a clean hit. A page that is not a product, or that yields
no record at all, is success: false with the reason in error.
When includeDebug is true, each row also carries the service's full _debug block
plus the raw security block (detected protection + fetch strategy; kept out of the
default output because its vendor reading can look odd without that context). If a URL
fails entirely, a single placeholder row is written with success: false,
reliable: false, and an error message.
Pricing
This actor uses Pay Per Event, with a single event:
url-parsed— one flat fee per product URL parsed.
You are charged once per URL that the engine actually parses — whether the result comes
back reliable: true or, honestly, reliable: false. The fee pays for the work that
ran either way: the anti-bot fetch, the model-by-meaning extraction, the Vision
cross-check, and the validation. A URL that hard-fails before producing any result
(network error, retries exhausted) is not charged.
No flat monthly fee. No per-seat pricing. One predictable price per processed URL, and in return you get validated data or an honest confidence — never a silent mistake. Reach, self-healing, and validation together are why this is not the cheapest option: it gets through to data on protected shops where a selector-based scraper just hits a challenge page, recovers when a layout shifts, and validates every field before it trusts it. The current price of the event is in the Apify Console pricing tab.
When NOT to use this
I would rather you not waste money, so:
- You want the cheapest possible scraper at huge volume and a few silently-wrong values per thousand are acceptable. This one spends real work self-healing and validating every field; if that correctness is not worth a mid price to you, a lighter selector-based scraper will cost less.
- You only have a handful of one-off URLs. Just paste them into the data.hilgard.cz UI and read the result. No actor run needed.
If the correctness story above is not what you need, one of these is the better call.
About
Built on the data.hilgard.cz parsing engine — the same self-healing stack that does anti-bot fetching, model-by-meaning extraction, and independent verification, here applied to single-URL product detail & price extraction.
By Jan Hilgard (founder of Hosting90, acquired; core contributor to vllm-mlx). The precision-first stance is deliberate: I would rather return an honest "not sure" than a confident wrong price you build decisions on.
Development
npm installnpm run build # tsc → dist/npm run start:dev # tsx src/main.ts, reads .actor/INPUT.json