Pricing

from $20.00 / 1,000 page processeds

AI Web Scraper — Structured Data Extraction

Extract structured JSON from public webpages using your own field schema. No CSS selectors. Ideal for products, jobs, articles, listings, RAG, and agents.

Pricing

from $20.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

Muhammad Afzal

Actor stats

Bookmarked

Total users

Monthly active users

10 days ago

Last modified

AI Web Scraper — Extract Structured Data From Any URL

Extract structured data from web pages with an LLM and your own field schema—without writing CSS selectors or maintaining site-specific extraction code. Give this AI web scraper URLs and field descriptions; receive clean JSON records for articles, products, jobs, directories, listings, research, and RAG pipelines.

Export results, run via API, schedule and monitor runs, or integrate with other tools and AI agents.

How it works

You provide one or more URLs and a list of fields (name + short description).
The actor fetches each page, converts it to clean text, and asks an LLM to return JSON matching your fields.
You get one row per record (or one row per repeating item in list mode).

Because extraction follows semantic field instructions instead of fixed selectors, it can tolerate many layout changes. Results still depend on the text the actor can fetch and the selected model's interpretation.

What can the AI web scraper extract?

Define string, number, boolean, object, or array-like fields for the page content you need. Common extraction targets include:

Product names, prices, availability, SKUs, and descriptions
Job titles, companies, locations, salary text, and requirements
Article titles, authors, dates, summaries, and topics
Directory listings, business details, categories, and profile URLs
Research facts, tables represented in page text, and source metadata
Repeating search-result or catalog cards with listMode

Input

Field	Type	Description
`startUrls`	array	The page URLs to extract from.
`fields`	array	What to extract — `[{ "name": "title", "description": "the product title", "type": "string" }]`.
`listMode`	boolean	ON = one row per repeating item on the page (grids, listings). OFF = one row per page.
`maxItems`	integer	Cap on total output rows.
`maxCrawlPages`	integer	Cap on pages fetched.
`maxContentChars`	integer	How much page text to send to the model (cost control).
`proxyConfiguration`	object	Apify proxy settings (datacenter by default).

Example input

{
  "startUrls": [{ "url": "https://quotes.toscrape.com" }],
  "fields": [
    { "name": "text", "description": "the full quote text" },
    { "name": "author", "description": "who said it" },
    { "name": "tags", "description": "list of tag labels", "type": "array" }
  ],
  "listMode": true
}

API key (required)

Extraction runs through OpenRouter — set a single environment variable on the actor (Console → Settings → Environment variables):

OPENROUTER_API_KEY = sk-or-...

The extraction model is managed internally for a predictable public input surface. You pay OpenRouter directly for model usage; the actor's PPE events cover the extraction layer.

Output

Every row contains source_url, scraped_at, error, plus your fields:

{
  "text": "The world as we have created it is a process of our thinking.",
  "author": "Albert Einstein",
  "tags": ["change", "deep-thoughts", "thinking", "world"],
  "source_url": "https://quotes.toscrape.com",
  "scraped_at": "2026-06-07T12:00:00.000Z",
  "error": null
}

Pricing (Pay Per Event)

Event	When
`actor-start`	Once per run.
`page-processed`	Each page successfully fetched and extracted (one LLM call).

Failed pages (fetch error, model error, missing key) are not charged.

The maximum number of chargeable page events is controlled by maxCrawlPages; maxItems limits the total rows returned. OpenRouter usage is billed separately through your own account. Review the live Store pricing panel for current Apify event prices.

Run the AI web scraper through the API

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('USERNAME/ai-web-extractor').call({
  startUrls: [{ url: 'https://quotes.toscrape.com' }],
  fields: [
    { name: 'text', description: 'Full quote text', type: 'string' },
    { name: 'author', description: 'Name of the quote author', type: 'string' },
    { name: 'tags', description: 'Tag labels', type: 'array' },
  ],
  listMode: true,
  maxItems: 100,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

Use the Apify API, schedules, webhooks, dataset exports, or Apify MCP to connect the records to a spreadsheet, database, RAG workflow, or AI agent.

Use cases

RAG / AI pipelines — turn arbitrary pages into clean structured records.
Long-tail sites — scrape sites with no dedicated actor.
Listings & directories — pull every item from a results page with listMode.
Monitoring — schedule extraction of the same fields over time.

When to use—and when not to use—this actor

Use it for a small or medium collection of known URLs when you need flexible, schema-guided extraction and a dedicated site actor does not exist. It is particularly useful for prototypes, long-tail websites, changing page layouts, and agent workflows where the requested fields vary.

Do not use it for authenticated pages, actions such as form submission, pixel-perfect browser automation, or very large commodity crawls where deterministic selectors are cheaper. It fetches the URLs you provide; it is not a general search engine or unlimited site crawler. JavaScript-heavy or strongly protected pages may require a dedicated browser-based actor.

Tips

Write clear field descriptions — they're the instructions the model follows.
Use listMode for pages with many repeating records; turn it off for single detail pages.
If a JavaScript-heavy page returns little text, a larger maxContentChars cannot recover content that was never present in the fetched response; use a dedicated browser actor instead.
Use explicit field names such as productPriceUsd rather than vague names such as value.
State desired formats in descriptions, for example “ISO 8601 date” or “number without currency symbol.”
Start with one representative URL before scaling to a larger batch.

Output quality and limitations

LLM extraction is probabilistic. Validate required fields, spot-check high-value records, and inspect error before downstream use. Pages can contain misleading text or prompt-like content; treat all extracted content as untrusted data. The actor does not guarantee completeness, factual accuracy, or successful access to every website.

FAQ

Can this scrape any website?

It can process many publicly accessible pages, but “any URL” does not mean every page is reachable. Login walls, bot protection, JavaScript-only rendering, robots restrictions, and unsupported content can limit extraction.

What is list mode?

Enable listMode when one page contains repeated records such as product cards or search results. Disable it for a single article, product detail, company profile, or other one-record page.

Is this suitable for RAG data extraction?

Yes, when the RAG pipeline needs concise structured metadata or facts from known pages. Keep source URLs with every record and validate important fields before indexing.

Responsible scraping

Only process content you are authorized to access. Respect website terms, robots policies where applicable, copyright, privacy, and data-protection laws. Do not use the actor to bypass access controls or collect sensitive personal data. Keep OPENROUTER_API_KEY in a secret environment variable.

Page to JSON — Structured Field Extraction for AI Agents

qualifyops/page-to-json

QualifyOps

AI Web Scraper — Structured Data Extraction from Any Website

oneary/ai-powered-data-extractor

Extract structured data from any webpage using AI. Define your schema and the AI identifies relevant content — no selectors or coding needed. Handles products, reviews, contacts, and custom fields.

Luan M.

Web Content Extractor API — URL to JSON

george.the.developer/web-content-extractor-api

Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.

George Kioko

AI Web Scraper — URL to JSON with Confidence

crisp_gopher/ai-scraper-to-json

Extract structured data from any website into typed JSON matching your schema, with a confidence score on every field. AI-powered, RAG-ready, with built-in schema validation and grounding to catch hallucinations.

Emploice Mushwashans

Claude AI Web Automation

dtrungtin/claude-ai-web-automation

A real browser with Anthropic's Claude models to navigate any website and extract structured data — no CSS selectors or page-specific scraping code required.

Tin

AI Web Extract — Structured Data from Any URL

logiover/ai-web-extract

Give a URL, get clean structured JSON — no LLM, no API key. Keyless Firecrawl Extract alternative that pulls schema.org JSON-LD, OpenGraph/meta, microdata, tables, prices, dates and contacts from any page. Built for AI agents, RAG and MCP.

Logiover

OpenAI Web Automation

dtrungtin/openai-web-automation

Controls a real browser with an OpenAI model to interact with web pages and extract structured data — no CSS selectors or page-specific scraping code required.

Tin

AI Web Scraper — Any Site to JSON with GPT or Claude

flash_scraper/ai-universal-scraper

AI web scraper that turns any URL into clean, structured JSON. List the fields you want or describe them in plain English, bring your own OpenAI (GPT) or Anthropic (Claude) key, and the model reads the page like a human — no CSS selectors, no per-site code. Export JSON, CSV, or Excel.