AI Web Scraper — Structured Data From Any URL
Pricing
from $20.00 / 1,000 page processeds
AI Web Scraper — Structured Data From Any URL
Extract structured data from any website using an LLM and your own field schema — no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.
Pricing
from $20.00 / 1,000 page processeds
Rating
0.0
(0)
Developer
Muhammad Afzal
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
a day ago
Last modified
Categories
Share
Extract structured data from any website using an LLM and your own field schema — no CSS selectors, no per-site code. Give it URLs and the fields you want; get back clean JSON rows. Built for the messy long tail of sites that off-the-shelf scrapers don't cover: blogs, job boards, product pages, directories, listings, and more.
Export results, run via API, schedule and monitor runs, or integrate with other tools and AI agents.
How it works
- You provide one or more URLs and a list of fields (name + short description).
- The actor fetches each page, converts it to clean text, and asks an LLM to return JSON matching your fields.
- You get one row per record (or one row per repeating item in list mode).
No selectors to maintain. When a site changes its HTML, the LLM still finds your fields.
Input
| Field | Type | Description |
|---|---|---|
startUrls | array | The page URLs to extract from. |
fields | array | What to extract — [{ "name": "title", "description": "the product title", "type": "string" }]. |
listMode | boolean | ON = one row per repeating item on the page (grids, listings). OFF = one row per page. |
model | string | OpenRouter model slug (default openai/gpt-4o-mini). |
maxItems | integer | Cap on total output rows. |
maxCrawlPages | integer | Cap on pages fetched. |
maxContentChars | integer | How much page text to send to the model (cost control). |
proxyConfiguration | object | Apify proxy settings (datacenter by default). |
Example input
{"startUrls": [{ "url": "https://quotes.toscrape.com" }],"fields": [{ "name": "text", "description": "the full quote text" },{ "name": "author", "description": "who said it" },{ "name": "tags", "description": "list of tag labels", "type": "array" }],"listMode": true,"model": "openai/gpt-4o-mini"}
API key (required)
Extraction runs through OpenRouter — set a single environment variable on the actor (Console → Settings → Environment variables):
OPENROUTER_API_KEY = sk-or-...
Pick any model via the model input — cheap models like openai/gpt-4o-mini or google/gemini-2.5-flash handle most structured extraction well. You pay OpenRouter directly for model usage; the actor's PPE events cover the extraction layer.
Output
Every row contains source_url, scraped_at, error, plus your fields:
{"text": "The world as we have created it is a process of our thinking.","author": "Albert Einstein","tags": ["change", "deep-thoughts", "thinking", "world"],"source_url": "https://quotes.toscrape.com","scraped_at": "2026-06-07T12:00:00.000Z","error": null}
Pricing (Pay Per Event)
| Event | When |
|---|---|
actor-start | Once per run. |
page-processed | Each page successfully fetched and extracted (one LLM call). |
Failed pages (fetch error, model error, missing key) are not charged.
Use cases
- RAG / AI pipelines — turn arbitrary pages into clean structured records.
- Long-tail sites — scrape sites with no dedicated actor.
- Listings & directories — pull every item from a results page with
listMode. - Monitoring — schedule extraction of the same fields over time.
Tips
- Write clear field descriptions — they're the instructions the model follows.
- Use
listModefor pages with many repeating records; turn it off for single detail pages. - For JS-heavy sites where text is missing, increase
maxContentCharsor use a richer model.