AI Web Scraper — URL to JSON with Confidence avatar

AI Web Scraper — URL to JSON with Confidence

Pricing

from $4.00 / 1,000 extracted results

Go to Apify Store
AI Web Scraper — URL to JSON with Confidence

AI Web Scraper — URL to JSON with Confidence

Extract structured data from any website into typed JSON matching your schema, with a confidence score on every field. AI-powered, RAG-ready, with built-in schema validation and grounding to catch hallucinations.

Pricing

from $4.00 / 1,000 extracted results

Rating

0.0

(0)

Developer

Emploice Mushwashans

Emploice Mushwashans

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

AI Web Scraper — URL to Structured JSON with Confidence Scores

Extract structured data from any website into clean, typed JSON that matches your own schema — with a per-field confidence score on every value. This AI-powered web scraper turns a URL and a JSON Schema into validated, LLM-ready structured data, so you know which fields to trust before you use them.

Unlike other AI scrapers and website-to-JSON tools that hand you data and hope it's right, this actor verifies every field against the source page and tells you how confident it is:

  • Your schema, enforced. Provide a standard JSON Schema; output is validated against it and automatically repaired if the model strays. No more missing or mistyped fields.
  • 📊 Per-field confidence (0–1). Every extracted value gets a score blending the model's token probability (logprobs) with a grounding check against the page text.
  • 🔍 Grounding catches hallucinations. If a value doesn't actually appear on the page, its confidence drops — so AI-invented data is easy to filter out.
  • 🤖 RAG-ready output. Clean, typed JSON built for AI pipelines, RAG systems, training datasets, and LLM agents.
  • 💸 Cheap and fast. Lightweight HTTP fetching keeps runs inexpensive. You pay per successful result.

What this AI scraper does

Give it one or more URLs and a description of the fields you want. For each page it fetches the HTML, extracts the data your schema asks for using an LLM (OpenAI or Gemini), validates the result, and scores how trustworthy each field is. Ideal for structured data extraction at scale where silent errors are expensive — product data, article metadata, company info, public datasets, and more.


What you get back

One dataset item per URL:

{
"url": "https://example.com/product/123",
"data": {
"title": "The Great Gatsby",
"price": 12.99,
"author": "F. Scott Fitzgerald"
},
"_confidence": {
"title": 0.99,
"price": 1.0,
"author": 0.97
},
"_grounded": true,
"_meta": {
"provider": "openai",
"model": "gpt-4o-mini",
"validation_failed": false,
"repair_attempts": 0,
"logprobs_available": true,
"input_truncated": false,
"usage": { "input_tokens": 812, "output_tokens": 34, "total_tokens": 846 }
}
}
  • data — your extracted object, typed per your schema.
  • _confidence — score per field. 1.0 = found verbatim on the page and the model was certain; low values = weak support, treat with suspicion.
  • _groundedtrue only when every field meets your grounding threshold. Filter your dataset on this for high-trust rows.
  • _meta — provider and model used, whether validation/repair fired, whether input was truncated, token usage.

Input

FieldRequiredDescription
urlsList of page URLs to extract from.
extractionSchemaA JSON Schema object (type: object with properties) describing what to extract.
provideropenai (default, supports logprobs) or gemini (cheapest).
modelModel for the chosen provider. Defaults: OpenAI gpt-4o-mini, Gemini gemini-2.5-flash-lite.
useLogprobsAdd the model's token-probability signal to confidence (default true). Auto-falls back to grounding-only on Gemini 2.5.
groundingThresholdA field counts as grounded only at/above this score (default 0.6).
openaiApiKeyYour OpenAI key (required when provider is openai and no built-in key is set).
geminiApiKeyYour Google AI Studio key (required when provider is gemini).
maxInputCharsCleaned page text is truncated to this many chars before extraction, bounding token cost (default 12000).
requestTimeoutSecsHTTP fetch timeout per URL (default 30).

Example input

{
"urls": ["https://www.gutenberg.org/ebooks/64317"],
"extractionSchema": {
"type": "object",
"properties": {
"title": { "type": "string", "description": "Book title" },
"author": { "type": "string" },
"language": { "type": "string" }
},
"required": ["title"]
},
"groundingThreshold": 0.6
}

How confidence is calculated

For each field, up to two signals are combined (no extra API calls):

  1. Grounding (always on) — fuzzy string match of the extracted value against the cleaned page text (rapidfuzz). Catches values the model invented that aren't on the page.
  2. Logprobs (when supported) — the model's averaged token probability for the value, from the same extraction call. Catches values the model itself was unsure about.

field_confidence = 0.6 × grounding + 0.4 × logprob when logprobs are available, otherwise grounding alone.

Provider note: OpenAI (gpt-4o-mini, the default) returns per-token logprobs, so you get the full combined score. Gemini's 2.5 models do not expose logprobs — with provider: gemini the actor automatically falls back to grounding-only confidence (still effective at catching hallucinations).

Grounding is a strong signal but not infallible: a value that happens to appear elsewhere on the page can score high even if it's the wrong field. Treat confidence as a strong filter, not a guarantee.

Use cases

  • AI & RAG pipelines — build clean, typed datasets for retrieval-augmented generation and LLM agents, dropping low-confidence rows automatically.
  • Training data extraction — turn web pages into structured JSON for model training and fine-tuning.
  • Price, spec & product data — extract e-commerce and catalog data where silent errors are expensive.
  • Article & content metadata — pull titles, authors, dates, and summaries from news, blogs, and docs.
  • Company & public datasets — extract structured records from directories, government portals, and open data.
  • Any pipeline where "the scraper returned something" isn't enough — you need to know if it's right.

Which websites work?

This actor reads server-rendered HTML — it works great on the large universe of sites that ship their content in the page source:

  • Works well: news sites, blogs, documentation, GitHub, Wikipedia, government and public-data portals, company/marketing pages, forums, and many server-rendered e-commerce and listing pages.
  • Not supported in v1: sites that build content with JavaScript in the browser (e.g. Yelp, Zillow, many single-page apps) or that sit behind aggressive bot protection (Cloudflare/DataDome). These return empty or blocked responses. JavaScript rendering is planned as a future opt-in.

If you're unsure, just try a single URL — the _meta.error field tells you immediately if a page couldn't be fetched.

Frequently asked questions

How is this different from other AI web scrapers? Most return raw JSON with no quality signal. This actor validates output against your JSON Schema, repairs it if needed, and attaches a confidence score to every field so you can trust — or filter — the results.

What does the confidence score mean? A 0–1 value per field. 1.0 means the value was found on the page and the model was certain; low values mean weak support. Filter on _grounded to keep only high-trust rows.

Which LLM does it use? OpenAI (gpt-4o-mini) by default for the richest confidence scoring, or Google Gemini for the lowest cost. You can supply your own API key.

Can it scrape Yelp / Zillow / Amazon? Only if the page content is in the raw HTML. JavaScript-rendered and bot-protected sites are not supported in v1 (see "Which websites work?").

Is the output ready for RAG / LLM pipelines? Yes — output is clean, typed JSON keyed to your schema, designed to drop straight into RAG systems, vector stores, and agents.

Limitations (v1)

  • Static HTML only — no headless browser, by design, to keep runs cheap. JavaScript-rendered sites are not yet supported.
  • No proxy rotation — heavily bot-protected sites may block requests.
  • Best for page-level object extraction, not deep multi-level crawling.