Pricing

$50.00 / 1,000 structured extractions

Structured Extract

Only pay when it works. $0.05 per verified extraction — nothing charged on failure or retries. Extract structured JSON from any webpage using your own schema. AJV-validated output guaranteed. Compatible with Groq, OpenAI, Together AI, and Ollama.

Pricing

$50.00 / 1,000 structured extractions

Rating

0.0

(0)

Developer

Herbert Yeboah

Actor stats

Bookmarked

Total users

Monthly active users

5 months ago

Last modified

Structured Data Extractor

Extract structured JSON from any webpage using a Groq-compatible LLM.

Provide a URL + a JSON Schema → get back validated, structured data. Works with Groq (free), OpenAI, Together AI, Fireworks AI, and Ollama.

What It Does

Scrapes the page at your URL using a real browser-grade crawler (CheerioCrawler)
Strips all HTML, navigation, scripts, and boilerplate → clean plain text
Prompts a Groq-compatible LLM to extract data matching your schema
Validates the response with AJV (JSON Schema validator)
Retries up to 3 times if the LLM returns invalid JSON, injecting the error back into the prompt
Returns validated structured data in the Apify dataset

Charge: $0.05 per successful extraction. Nothing charged on failure.

Input Schema

Field	Type	Required	Default	Description
`url`	string	✅	—	Page to scrape
`output_schema`	object	✅	—	JSON Schema defining the data to extract
`groq_api_key`	string	✅	—	API key (Groq, OpenAI, Together AI, etc.)
`model`	string	❌	`llama-3.3-70b-versatile`	Model name
`base_url`	string	❌	Groq endpoint	For OpenAI-compatible providers

Usage Examples

Example 1: Groq (default, free tier)

Get a free API key at console.groq.com.

{
    "url": "https://example.com/product/widget-pro",
    "groq_api_key": "gsk_YOUR_GROQ_KEY_HERE",
    "output_schema": {
        "type": "object",
        "required": ["name", "price"],
        "properties": {
            "name":        { "type": "string" },
            "price":       { "type": "number" },
            "description": { "type": "string" },
            "in_stock":    { "type": "boolean" }
        }
    }
}

Output:

{
    "url": "https://example.com/product/widget-pro",
    "extracted": {
        "name": "Widget Pro",
        "price": 29.99,
        "description": "The best widget on the market.",
        "in_stock": true
    },
    "model": "llama-3.3-70b-versatile",
    "attempts": 1
}

Example 2: OpenAI-compatible endpoint (Together AI, Fireworks AI)

Use any OpenAI-compatible provider by setting base_url:

{
    "url": "https://jobs.lever.co/anthropic/engineer",
    "groq_api_key": "YOUR_TOGETHER_AI_KEY",
    "base_url": "https://api.together.xyz/v1",
    "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
    "output_schema": {
        "type": "object",
        "required": ["title", "company", "location", "salary_range"],
        "properties": {
            "title":        { "type": "string" },
            "company":      { "type": "string" },
            "location":     { "type": "string" },
            "salary_range": { "type": "string" },
            "remote":       { "type": "boolean" },
            "requirements": {
                "type": "array",
                "items": { "type": "string" }
            }
        }
    }
}

Other compatible endpoints:

Fireworks AI: https://api.fireworks.ai/inference/v1
OpenAI: https://api.openai.com/v1

Example 3: Ollama (local, completely free)

Run models locally at zero cost with Ollama:

# Start Ollama with a model
ollama serve
ollama pull llama3.3

{
    "url": "https://news.ycombinator.com/item?id=12345",
    "groq_api_key": "ollama",
    "base_url": "http://localhost:11434/v1",
    "model": "llama3.3",
    "output_schema": {
        "type": "object",
        "required": ["title", "score", "comments_count"],
        "properties": {
            "title":          { "type": "string" },
            "score":          { "type": "integer" },
            "comments_count": { "type": "integer" },
            "author":         { "type": "string" },
            "url":            { "type": "string" }
        }
    }
}

Note: When running the Actor on Apify cloud, Ollama requires a remote endpoint. For local testing, use apify run with localhost.

Common Use Cases

Use Case	Schema Fields
Product extraction	name, price, description, in_stock, SKU
Job postings	title, company, location, salary, requirements
News articles	headline, author, published_date, summary, tags
Real estate listings	address, price, bedrooms, bathrooms, sqft
Restaurant menus	restaurant_name, items (name, price, description)
Resume parsing	name, email, skills, experience, education
Event listings	name, date, venue, ticket_price, organizer

How Retry Logic Works

The actor uses the same retry-with-feedback pattern as constrained.py from the DagPipe core library:

Attempt 1: Send text + schema → LLM responds → AJV validates
On failure: Inject the exact AJV error message into the next prompt → retry
Attempt 2: LLM receives error and corrects → validate again
After 3 failures: Throw with a descriptive error message

This approach reliably extracts valid structured data even from smaller/cheaper models.

Pricing

$0.05 per successful extraction (Pay-Per-Event)
Free if extraction fails — you're never charged for failed attempts
Groq's free tier provides 30 requests/minute at zero cost to you

Technical Details

Scraper: CheerioCrawler (zero-JS, fast, reliable)
Validation: AJV v8 + ajv-formats (JSON Schema Draft-07/2019/2020 compatible)
LLM client: OpenAI SDK (works with any OpenAI-compatible endpoint)
Retry strategy: Error-feedback prompting (same pattern as DagPipe constrained.py)
Language: TypeScript, Node.js 20+
Tests: 9 vitest tests (100% passing)

Built With

DagPipe — Zero-cost, crash-proof LLM pipeline orchestrator.

$pip install dagpipe-core

Ecommerce Price Extractor

gastronomic_desk/ecommerce-price-extractor

Monitor competitor prices on any online store. Extracts name, price, currency, stock status, SKU, and description using AI. AJV-validated output. Only charged on successful extraction — $0.05 per URL.

Herbert Yeboah

Ollama Apify Mcp

lenticular_negative/ollama-apify-mcp

The Ollama MCP Actor brings together Apify’s web-scraping power with fast, private, on-device AI. No external APIs required. It lets you run local LLMs through Ollama using the Model Context Protocol, so you can analyze scraped data, extract insights, and generate responses with full control.

Anwesh Mishra

AI Web Scraper — Structured Data Extraction from Any Website

oneary/ai-powered-data-extractor

Extract structured data from any webpage using AI. Define your schema and the AI identifies relevant content — no selectors or coding needed. Handles products, reviews, contacts, and custom fields.

Luan M.

Structured Extract

romanrostar/structured-extract

Roman Rostar

AI Web Scraper — Structured Data Extraction

muhammadafzal/ai-web-extractor

Extract structured JSON from public webpages using your own field schema. No CSS selectors. Ideal for products, jobs, articles, listings, RAG, and agents.

Muhammad Afzal

Structured Data Extractor — URL to JSON

shelvick/structured-extractor

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

Scott Helvick

OpenAI Web Scraper

dtrungtin/openai-web-scraper

Crawl web pages and extract structured information using OpenAI

Tin

Ai Api Status

david_flagg/ai-api-status

Monitor health, response times, and availability of 9 major AI APIs — OpenAI, Anthropic, Gemini, OpenRouter, Venice, Groq, Together, Fireworks, and Mistral. Real incident data from status pages. Works without API keys.

David Flagg

Website Contact Extractor — AI Contacts, Emails, Phones

santamaria-automations/website-contact-extractor

Extract structured contacts (name, position, email, phone, LinkedIn) plus emails, phones, socials and address from any company website. AI-powered — Gemini / Groq / OpenRouter (bring your own key). Pay-per-result.