AI Web Scraper — Structured Data From Any URL avatar

AI Web Scraper — Structured Data From Any URL

Pricing

from $20.00 / 1,000 page processeds

Go to Apify Store
AI Web Scraper — Structured Data From Any URL

AI Web Scraper — Structured Data From Any URL

Extract structured data from any website using an LLM and your own field schema — no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.

Pricing

from $20.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

Muhammad Afzal

Muhammad Afzal

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

a day ago

Last modified

Share

Extract structured data from any website using an LLM and your own field schema — no CSS selectors, no per-site code. Give it URLs and the fields you want; get back clean JSON rows. Built for the messy long tail of sites that off-the-shelf scrapers don't cover: blogs, job boards, product pages, directories, listings, and more.

Export results, run via API, schedule and monitor runs, or integrate with other tools and AI agents.


How it works

  1. You provide one or more URLs and a list of fields (name + short description).
  2. The actor fetches each page, converts it to clean text, and asks an LLM to return JSON matching your fields.
  3. You get one row per record (or one row per repeating item in list mode).

No selectors to maintain. When a site changes its HTML, the LLM still finds your fields.


Input

FieldTypeDescription
startUrlsarrayThe page URLs to extract from.
fieldsarrayWhat to extract — [{ "name": "title", "description": "the product title", "type": "string" }].
listModebooleanON = one row per repeating item on the page (grids, listings). OFF = one row per page.
modelstringOpenRouter model slug (default openai/gpt-4o-mini).
maxItemsintegerCap on total output rows.
maxCrawlPagesintegerCap on pages fetched.
maxContentCharsintegerHow much page text to send to the model (cost control).
proxyConfigurationobjectApify proxy settings (datacenter by default).

Example input

{
"startUrls": [{ "url": "https://quotes.toscrape.com" }],
"fields": [
{ "name": "text", "description": "the full quote text" },
{ "name": "author", "description": "who said it" },
{ "name": "tags", "description": "list of tag labels", "type": "array" }
],
"listMode": true,
"model": "openai/gpt-4o-mini"
}

API key (required)

Extraction runs through OpenRouter — set a single environment variable on the actor (Console → Settings → Environment variables):

OPENROUTER_API_KEY = sk-or-...

Pick any model via the model input — cheap models like openai/gpt-4o-mini or google/gemini-2.5-flash handle most structured extraction well. You pay OpenRouter directly for model usage; the actor's PPE events cover the extraction layer.


Output

Every row contains source_url, scraped_at, error, plus your fields:

{
"text": "The world as we have created it is a process of our thinking.",
"author": "Albert Einstein",
"tags": ["change", "deep-thoughts", "thinking", "world"],
"source_url": "https://quotes.toscrape.com",
"scraped_at": "2026-06-07T12:00:00.000Z",
"error": null
}

Pricing (Pay Per Event)

EventWhen
actor-startOnce per run.
page-processedEach page successfully fetched and extracted (one LLM call).

Failed pages (fetch error, model error, missing key) are not charged.


Use cases

  • RAG / AI pipelines — turn arbitrary pages into clean structured records.
  • Long-tail sites — scrape sites with no dedicated actor.
  • Listings & directories — pull every item from a results page with listMode.
  • Monitoring — schedule extraction of the same fields over time.

Tips

  • Write clear field descriptions — they're the instructions the model follows.
  • Use listMode for pages with many repeating records; turn it off for single detail pages.
  • For JS-heavy sites where text is missing, increase maxContentChars or use a richer model.