LLM Web Scraper avatar
LLM Web Scraper
Under maintenance

Pricing

from $2.00 / 1,000 scraped pages

Go to Apify Store
LLM Web Scraper

LLM Web Scraper

Under maintenance

Turn any website into structured JSON using AI. Supports OpenAI GPT-4 and Anthropic Claude. Built in Rust to minimize compute costs while waiting for LLM responses. Extract data without selectors.

Pricing

from $2.00 / 1,000 scraped pages

Rating

0.0

(0)

Developer

Daniel Rosen

Daniel Rosen

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Share

Turn any website into structured JSON data using OpenAI (GPT) or Anthropic (Claude) models. This Actor fetches HTML, cleans it, and uses a Large Language Model (LLM) to extract specific data fields based on a schema you provide.

What This Does

Traditional scraping requires writing brittle CSS selectors or Regex for every field. This Actor uses semantic understanding to locate and extract data, making it resilient to layout changes.

  1. Fetches: Downloads the webpage (supports User-Agent rotation).
  2. Cleans: Prunes scripts, styles, and ads to minimize token usage.
  3. Extracts: Sends the content to an LLM alongside your JSON schema.
  4. Validates: Returns structured JSON matching your definition.

Use Cases

  • E-commerce: Extract product details (price, specs, availability) from diverse layouts.
  • News Aggregation: Normalize article content, authors, and dates into a standard format.
  • Lead Generation: Extract contact info and company details from "About Us" pages.
  • Financial Data: Parse unstructured tables and reports into usable JSON.

Input

You must provide a target URL and a JSON Schema defining the data you want. You must also provide either an OpenAI or Anthropic API key.

{
"url": "https://news.ycombinator.com",
"schema": {
"stories": [
{
"title": "string",
"points": "number",
"author": "string"
}
]
},
"openaiApiKey": "sk-...",
"model": "gpt-4o",
"maxTokens": 2000,
"selector": "table.itemlist"
}

Configuration Parameters

FieldTypeRequiredDescription
urlStringYesThe target URL to scrape.
schemaJSONYesThe structure you want the AI to extract.
openaiApiKeyStringNo*OpenAI API Key (Required if using GPT models).
anthropicApiKeyStringNo*Anthropic API Key (Required if using Claude models).
modelStringNoModel selection (e.g., gpt-4o, claude-3-5-sonnet).
selectorStringNoCSS selector to limit the scope (e.g., main#content).
instructionsStringNoSpecific guidance for the AI (e.g., "Exclude ads").
maxTokensIntegerNoLimit response size (Default: 4096).

*One of the two API keys is required.

Output

The Actor outputs a JSON object containing the extracted data and metadata about the run.

{
"url": "https://news.ycombinator.com",
"success": true,
"tokensUsed": 1450,
"model": "gpt-4o",
"data": {
"stories": [
{ "title": "Rust vs C++", "points": 156, "author": "dev_user" },
{ "title": "New AI Model", "points": 42, "author": "ai_researcher" }
]
}
}

Optimization & Costs

LLM scraping involves two costs: the Apify run cost and your external LLM API usage. To minimize both:

  1. Use Selectors: Always provide a CSS selector (e.g., div.product-details) if possible. This discards headers, footers, and sidebars before sending text to the AI, significantly reducing your token bill.
  2. Choose the Right Model: gpt-4o and claude-3-5-sonnet offer the best balance of speed and intelligence. Smaller models are cheaper but may struggle with complex schemas.
  3. Clean Input: The Actor automatically removes <script>, <style>, and <nav> tags to ensure high-quality context for the AI.