Pricing

$2.00 / 1,000 url parseds

Product Detail & Price Parser

Reliable product detail & price extraction from a URL, even on shops behind heavy anti-bot protection. Returns name, price, currency, stock, EAN and image with per-field confidence and a fail-loud reliable flag — never silently wrong. Built on the data.hilgard.cz self-healing engine.

Pricing

$2.00 / 1,000 url parseds

Rating

0.0

(0)

Developer

Jan Hilgard

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

Why this beats self-healing / AI scrapers

Self-healing and LLM scrapers adapt to a layout change — that part is table-stakes now, this one does it too. But adapting is not the same as being right: a scraper can re-find "a price" after a redesign and still hand you a plausible-but-wrong field — the wrong variant, a stale crossed-out price, a swapped currency — and say nothing. That silent wrong value is the expensive one. The difference here is the guarantee, not the healing: every field carries its own confidence, an EAN/GTIN must pass its check digit, and below the bar the field comes back reliable: false instead of a guess. An honest fail you can catch beats a confident mistake you cannot. Self-healing is how the field gets re-found; the confidence and validation are what decide whether you are allowed to trust it.

What a real run looks like

One URL in, one structured row out — with the engine's own confidence attached.

(a) The layout changed. A selector-based scraper would silently break. This doesn't.

URL:  https://shop.example/iphone-17-512gb   (shop redesigned its price markup last week)

✓  product_name   "Apple iPhone 17 512 GB Lavender"
✓  price          27990            field confidence: high
✓  currency       "CZK"
✓  ean            "1949483485739"  field confidence: high (check-digit valid)
   reliable:      true             overall confidence: 0.96
   note:          HTML price node moved; the Vision pass read the price off the
                  rendered page and the two agreed — extraction self-healed.

A hard-coded selector would have returned null (or worse, grabbed the crossed-out old price). The Vision cross-check caught the moved element and confirmed it.

(b) The page is ambiguous. It does NOT guess — it fails loud.

URL:  https://shop.example/landing/iphone-deals   (a promo page, not one product)

✗  price          null             field confidence: low
   reliable:      false            overall confidence: 0.28
   success:       false
   note:          multiple prices on the page, none clearly the product's — returned
                  unreliable with low confidence rather than a guessed number.

The value is that second row. A cheaper tool would have returned a price here — some price on the page — and you would have trusted it.

Why it is different

No silent errors. Every field carries a confidence, the whole record carries a reliable flag, and codes are validated — an EAN/GTIN must pass its check digit to be trusted. Below the confidence bar the field is returned null with low confidence and reliable: false, never a confident-looking wrong value.
Wrong input gets an honest answer too. Point it at a URL that is not a product detail page — a category listing, a promo landing page, a real-estate advert — and it does not hallucinate a product out of whatever it found. It returns success: false with error: "not a product detail page (entity: …)". Even a bad input gets the truth, not a fairy tale. Same no-silent-errors rule, applied to the page itself.
Self-healing — a mechanism that serves the correctness above, not the headline. The engine cross-checks the HTML reading against a Vision reading of the rendered page, so a layout change that breaks a selector is caught instead of silently returning the wrong node.
Gets through where plain scrapers hit a wall. The heavy anti-bot protections — DataDome, Cloudflare, Akamai, PerimeterX and similar — return a challenge page instead of data to a simple fetch. This reaches the actual product page through them, so you get the product, not a CAPTCHA. (Anti-bot is an arms race, so this is a capability that handles these protections, not a guarantee against any one named vendor.)
Predictable mid price. One flat price per URL parsed. Not the cheapest unreliable scraper where a few wrong values per thousand are "fine", and not an enterprise per-seat contract. You pay one predictable price and get validated data or an honest confidence — the self-healing and the validation are why it sits in the middle, not at the bottom.

How it works

Per URL, from input to record:

Fetch — get through to the real page. Reach the product page through the heavy anti-bot / WAF layers (DataDome, Cloudflare, Akamai, PerimeterX and the like) and render JS-driven content, so the extractor sees the actual product the way a real browser would — not a challenge page, an interstitial, or an empty SPA shell.
Extract by meaning. Read the product fields by what they mean on the page, not by a fixed selector path that a redesign would break.
Cross-check & validate. Reconcile independent readings of the same field (HTML and Vision) — this HTML+Vision cross-check is the self-healing part, and it is here to serve correctness — score a per-field confidence, and run deterministic checks where they exist — an EAN/GTIN must pass its check digit.
Fail loud below the bar. When confidence does not clear the threshold, the field is returned null and the record is marked reliable: false, with the reason — rather than emitting a guess.

This is the principle, not a recipe: the actor deliberately exposes no thresholds, weights, or model names, because those are tuned on the engine side and change.

Mechanically: the actor is a thin client. The heavy work (anti-bot fetch, headed browser, model-by-meaning extraction, Vision cross-check, validation) runs on the data.hilgard.cz service. The actor sends the URL to POST /api/parse, streams the engine's progress over SSE, and writes one validated row to the dataset. A fresh extraction can take minutes; if the stream drops before a result, the actor retries the parse within the per-URL budget.

Input

{
  "urls": [
    "https://www.alza.cz/apple-iphone-17-512gb"   // 1–N product detail URLs
  ],
  "forceFresh": false,          // optional, force a fresh extraction (skip cache)
  "includeDebug": false,        // optional, keep the full _debug block in each row
  "requestTimeoutSecs": 360,    // optional, per-URL budget across retries (hard-capped at 600)
  "maxRetries": 3,              // optional, re-parse attempts after a dropped stream
  "retryBackoffSecs": 3         // optional, delay between retries
}

Each entry in urls must be an http(s) URL and is parsed independently, in order. forceFresh: false (the default) lets the service serve a fast cached extraction when it has one; set it true to force a full fresh self-healing extraction every time. requestTimeoutSecs is the per-URL budget across retries (default 360s); a single parse is hard-capped at 600s no matter what you set, so one pathologically slow page is aborted at the cap and reported as a fail-loud error (per-URL time budget exceeded) — and a URL cut off this way produces no result, so it is not charged.

Output

One dataset row per URL:

{
  "url": "https://www.alza.cz/apple-iphone-17-512gb",
  "domain": "alza.cz",
  "product_name": "Apple iPhone 17 512 GB",
  "price": 27990,
  "currency": "CZK",
  "in_stock": true,
  "ean": "1949483485739",
  "image_url": "https://...",
  "reliable": true,
  "overall_confidence": 0.96,
  "fields": {
    "product_name": { "status": "confirmed", "score": 1 },
    "price":        { "status": "high",      "score": 0.97 },
    "currency":     { "status": "confirmed", "score": 1 },
    "in_stock":     { "status": "confirmed", "score": 1 },
    "ean":          { "status": "confirmed", "score": 1 },
    "image_url":    { "status": "single",    "score": 1 }
  },
  "entity": "product",
  "scraped_at": "2026-06-18T10:00:00.000Z",
  "success": true
}

The product payload is exactly the fields the engine reliably yields — product_name, price, currency, in_stock, ean, image_url. It does not invent brand / model / specs columns; you get what the page actually supports, validated, rather than padded with empty fields. domain is normalized without a www. prefix.

reliable and overall_confidence (0–1) are the top-level trust signals. Under fields, every product field carries its own status and numeric score (0–1) — and every field appears, including the ones that came back empty: a missing ean is { "status": "absent", "score": null }, never silently omitted, so a null always has a stated reason.

success vs reliable: success says the parse technically produced a product record from a product page; reliable says that record cleared the engine's trust threshold. They diverge when a product was extracted but the engine is not confident in it — then success is true while reliable is false, and the per-field scores stay low, so a hedge never reads as a clean hit. A page that is not a product, or that yields no record at all, is success: false with the reason in error.

When includeDebug is true, each row also carries the service's full _debug block plus the raw security block (detected protection + fetch strategy; kept out of the default output because its vendor reading can look odd without that context). If a URL fails entirely, a single placeholder row is written with success: false, reliable: false, and an error message.

Pricing

This actor uses Pay Per Event, with a single event:

url-parsed — one flat fee per product URL parsed.

You are charged once per URL that the engine actually parses — whether the result comes back reliable: true or, honestly, reliable: false. The fee pays for the work that ran either way: the anti-bot fetch, the model-by-meaning extraction, the Vision cross-check, and the validation. A URL that hard-fails before producing any result (network error, retries exhausted) is not charged.

No flat monthly fee. No per-seat pricing. One predictable price per processed URL, and in return you get validated data or an honest confidence — never a silent mistake. Reach, self-healing, and validation together are why this is not the cheapest option: it gets through to data on protected shops where a selector-based scraper just hits a challenge page, recovers when a layout shifts, and validates every field before it trusts it. The current price of the event is in the Apify Console pricing tab.

When NOT to use this

I would rather you not waste money, so:

You want the cheapest possible scraper at huge volume and a few silently-wrong values per thousand are acceptable. This one spends real work self-healing and validating every field; if that correctness is not worth a mid price to you, a lighter selector-based scraper will cost less.
You only have a handful of one-off URLs. Just paste them into the data.hilgard.cz UI and read the result. No actor run needed.

If the correctness story above is not what you need, one of these is the better call.

About

Built on the data.hilgard.cz parsing engine — the same self-healing stack that does anti-bot fetching, model-by-meaning extraction, and independent verification, here applied to single-URL product detail & price extraction.

By Jan Hilgard (founder of Hosting90, acquired; core contributor to vllm-mlx). The precision-first stance is deliberate: I would rather return an honest "not sure" than a confident wrong price you build decisions on.

Development

npm install
npm run build      # tsc → dist/
npm run start:dev  # tsx src/main.ts, reads .actor/INPUT.json

Validated Jobs Scraper: Dedup, No Ghost Jobs, Confidence Scored

jan_hilgard/validated-jobs-scraper

Job data with a correctness guarantee: per-field confidence, ghost-job filtering and cross-source dedup — never silently wrong, duplicated or expired. Reaches LinkedIn and ATS boards cookieless, gets through Cloudflare/DataDome, self-healing on layout shifts. Built on the data.hilgard.cz engine.

Jan Hilgard

Walmart Price & Stock Scraper - Product Data by ID or URL

bujhmml/walmart-price-stock-scraper

Look up real-time Walmart.com price, in-stock availability, was-price, unit price, and seller for any product by item ID or URL. Returns clean structured JSON. Built for AI agents (MCP) and price monitoring - bypasses Walmart anti-bot on US residential proxies, charged per successful result.

Ihor Bielievskiy

Glassdoor Scraper Pro – Reviews, Jobs & Salaries Extractor

ahmed_jasarevic/glassdoor-scraper

Glassdoor Scraper Pro is a powerful Apify Actor built for reliable and large-scale extraction of Glassdoor reviews, jobs, and salary data — even behind aggressive Cloudflare and anti-bot protection.

Ahmed Jasarevic

Amazon Product Price & Stock Data Scraper

shaggydev/amazon-price-stock-metadata-scraper

Amazon product scraper for keyword search, bulk product discovery, and URL/ASIN tracking. Extract product price, stock, availability, ASIN, title, ratings, Prime status, marketplace, and confidence fields. Built for price monitoring, inventory alerts, and competitor research.

Michael

Product Matcher

jan_hilgard/productmatcher

Cross-shop product matcher: returns a confident match or nothing, never a wrong pair. Give it a sample product URL and target e-shop domains; it finds the same product on each, verified by EAN check-digit, spec agreement, AI variant judge and photo match, with a 0-1 confidence score.

Jan Hilgard

Airbnb Scraper — Listings, Prices & Ratings (Pay Per Listing)

thebrierfox/airbnb-listings-scraper

Reliable Airbnb search scraper. Extracts listings with name, price, rating, coordinates, images, badges and direct URL. Self-healing parser, cursor pagination, date & guest filters. Pay only for listings delivered.

Brier Fox

Hepsiburada Scraper — Product, Price & Reviews

great_saint/hepsiburada-scraper

Scrape Hepsiburada product name, price, stock, rating & review count by keyword. Built for AI agents & price monitoring: charged only on successful results (charge-on-success). TR residential proxies bypass anti-bot. MCP & x402 ready.

Öge

Etsy Product Scraper

yesintelligent/etsy-product-scraper

Scrapes product data from Etsy.com, including title, price, images, and description. Uses residential proxies to bypass anti-bot protection.

yesintelligent

Universal Product Price Scraper

flipper_ai/universal-product-price-scraper

Extract product price, title, currency, availability, brand, SKU, and image from any product URL using structured data (JSON-LD / Open Graph). No browser, fast and cheap.

Josh Baker

Pricecom Scraper

chimerical_quicklime/pricecom-scraper

Scrape Price.com product price comparison by keyword: product name, offer price, list price, merchant, cashback, and URL. Built for retail price intelligence and competitive monitoring.