Pricing

from $2.00 / 1,000 scraped pages

LLM Web Scraper

Turn any website into structured JSON using AI. Supports OpenAI GPT-4 and Anthropic Claude. Built in Rust to minimize compute costs while waiting for LLM responses. Extract data without selectors.

Pricing

from $2.00 / 1,000 scraped pages

Rating

0.0

(0)

Developer

Daniel Rosen

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What This Does

Traditional scraping requires writing brittle CSS selectors or Regex for every field. This Actor uses semantic understanding to locate and extract data, making it resilient to layout changes.

Fetches: Downloads the webpage (supports User-Agent rotation).
Cleans: Prunes scripts, styles, and ads to minimize token usage.
Extracts: Sends the content to an LLM alongside your JSON schema.
Validates: Returns structured JSON matching your definition.

Use Cases

E-commerce: Extract product details (price, specs, availability) from diverse layouts.
News Aggregation: Normalize article content, authors, and dates into a standard format.
Lead Generation: Extract contact info and company details from "About Us" pages.
Financial Data: Parse unstructured tables and reports into usable JSON.

Input

You must provide a target URL and a JSON Schema defining the data you want. You must also provide either an OpenAI or Anthropic API key.

{
  "url": "https://news.ycombinator.com",
  "schema": {
    "stories": [
      {
        "title": "string",
        "points": "number",
        "author": "string"
      }
    ]
  },
  "openaiApiKey": "sk-...",
  "model": "gpt-4o",
  "maxTokens": 2000,
  "selector": "table.itemlist"
}

Configuration Parameters

Field	Type	Required	Description
url	String	Yes	The target URL to scrape.
schema	JSON	Yes	The structure you want the AI to extract.
openaiApiKey	String	No*	OpenAI API Key (Required if using GPT models).
anthropicApiKey	String	No*	Anthropic API Key (Required if using Claude models).
model	String	No	Model selection (e.g., `gpt-4o`, `claude-3-5-sonnet`).
selector	String	No	CSS selector to limit the scope (e.g., `main#content`).
instructions	String	No	Specific guidance for the AI (e.g., "Exclude ads").
maxTokens	Integer	No	Limit response size (Default: 4096).

*One of the two API keys is required.

Output

The Actor outputs a JSON object containing the extracted data and metadata about the run.

{
  "url": "https://news.ycombinator.com",
  "success": true,
  "tokensUsed": 1450,
  "model": "gpt-4o",
  "data": {
    "stories": [
      { "title": "Rust vs C++", "points": 156, "author": "dev_user" },
      { "title": "New AI Model", "points": 42, "author": "ai_researcher" }
    ]
  }
}

Optimization & Costs

LLM scraping involves two costs: the Apify run cost and your external LLM API usage. To minimize both:

Use Selectors: Always provide a CSS selector (e.g., div.product-details) if possible. This discards headers, footers, and sidebars before sending text to the AI, significantly reducing your token bill.
Choose the Right Model: gpt-4o and claude-3-5-sonnet offer the best balance of speed and intelligence. Smaller models are cheaper but may struggle with complex schemas.
Clean Input: The Actor automatically removes <script>, <style>, and <nav> tags to ensure high-quality context for the AI.

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

Louis Deconinck

162

5.0

Phone Number Validation

easyapi/phone-number-validation

Validate, parse, and retrieve location information for any phone number with our Phone Number Validation API. Determine if a number is local or international, identify its type (fixed line or mobile), and reformat it for local and international dialing.

EasyApi

285

Perplexity.AI Actor

jons/perplexity-actor

Use the Perplexity.ai Scraper to extract information with AI. For example: "Find hotels in Prague that offer free breakfast and have a nightly rate under 1000 CZK." Export the results into a structured dataset.

Jon

128

1.0

Smartcontext AI Web Crawler

bluelightco/smartcontext-ai-crawler

Scrape any website and extract structured data using AI-powered instructions. Provide URLs and a natural language prompt to get tailored JSON outputs.

Bluelight

146

5.0

Phone Validator

tomba-io/phone-validator

Verify phone number formats, check carrier information, and get detailed validation results. Furthermore, our phone validator supports international numbers and provides comprehensive line type analysis.

Tomba io

Twilio SMS Parser

onidivo/twilio-sms-parser

Parse incoming Twilio SMS message.

Onidivo Technologies

Merge, Dedup & Transform Datasets

lukaskrivka/dedup-datasets

The ultimate dataset processor. Extremely fast merging, deduplications & transformations all in a single run.

Lukáš Křivka

4.9K

5.0

Perplexity

winbayai/perplexity

Our advanced API, powered by AI, enables seamless Google、Bing、Wiki... data search and analysis, transforming raw data into actionable insights. It streamlines data retrieval for market research and trend tracking, enhancing decision-making accuracy across diverse industries.

Winbay

Google Maps Scraper Orchestrator

lukaskrivka/google-maps-scraper-orchestrator

Run multiple locations and search terms together with parallel runs for maximum speed.

Lukáš Křivka

310

5.0

Email Validator Pro 📧

giovannibiancia/email-validator

Advanced email validation actor for Apify platform with comprehensive verification capabilities including SMTP verification, MX record checks, and disposable email detection.