LLM Web Scraper
Pricing
from $2.00 / 1,000 scraped pages
LLM Web Scraper
Turn any website into structured JSON using AI. Supports OpenAI GPT-4 and Anthropic Claude. Built in Rust to minimize compute costs while waiting for LLM responses. Extract data without selectors.
Pricing
from $2.00 / 1,000 scraped pages
Rating
0.0
(0)
Developer

Daniel Rosen
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
2 days ago
Last modified
Categories
Share
Turn any website into structured JSON data using OpenAI (GPT) or Anthropic (Claude) models. This Actor fetches HTML, cleans it, and uses a Large Language Model (LLM) to extract specific data fields based on a schema you provide.
What This Does
Traditional scraping requires writing brittle CSS selectors or Regex for every field. This Actor uses semantic understanding to locate and extract data, making it resilient to layout changes.
- Fetches: Downloads the webpage (supports User-Agent rotation).
- Cleans: Prunes scripts, styles, and ads to minimize token usage.
- Extracts: Sends the content to an LLM alongside your JSON schema.
- Validates: Returns structured JSON matching your definition.
Use Cases
- E-commerce: Extract product details (price, specs, availability) from diverse layouts.
- News Aggregation: Normalize article content, authors, and dates into a standard format.
- Lead Generation: Extract contact info and company details from "About Us" pages.
- Financial Data: Parse unstructured tables and reports into usable JSON.
Input
You must provide a target URL and a JSON Schema defining the data you want. You must also provide either an OpenAI or Anthropic API key.
{"url": "https://news.ycombinator.com","schema": {"stories": [{"title": "string","points": "number","author": "string"}]},"openaiApiKey": "sk-...","model": "gpt-4o","maxTokens": 2000,"selector": "table.itemlist"}
Configuration Parameters
| Field | Type | Required | Description |
|---|---|---|---|
| url | String | Yes | The target URL to scrape. |
| schema | JSON | Yes | The structure you want the AI to extract. |
| openaiApiKey | String | No* | OpenAI API Key (Required if using GPT models). |
| anthropicApiKey | String | No* | Anthropic API Key (Required if using Claude models). |
| model | String | No | Model selection (e.g., gpt-4o, claude-3-5-sonnet). |
| selector | String | No | CSS selector to limit the scope (e.g., main#content). |
| instructions | String | No | Specific guidance for the AI (e.g., "Exclude ads"). |
| maxTokens | Integer | No | Limit response size (Default: 4096). |
*One of the two API keys is required.
Output
The Actor outputs a JSON object containing the extracted data and metadata about the run.
{"url": "https://news.ycombinator.com","success": true,"tokensUsed": 1450,"model": "gpt-4o","data": {"stories": [{ "title": "Rust vs C++", "points": 156, "author": "dev_user" },{ "title": "New AI Model", "points": 42, "author": "ai_researcher" }]}}
Optimization & Costs
LLM scraping involves two costs: the Apify run cost and your external LLM API usage. To minimize both:
- Use Selectors: Always provide a CSS
selector(e.g.,div.product-details) if possible. This discards headers, footers, and sidebars before sending text to the AI, significantly reducing your token bill. - Choose the Right Model:
gpt-4oandclaude-3-5-sonnetoffer the best balance of speed and intelligence. Smaller models are cheaper but may struggle with complex schemas. - Clean Input: The Actor automatically removes
<script>,<style>, and<nav>tags to ensure high-quality context for the AI.