Structured Extract avatar

Structured Extract

Pricing

$50.00 / 1,000 structured extractions

Go to Apify Store
Structured Extract

Structured Extract

Only pay when it works. $0.05 per verified extraction — nothing charged on failure or retries. Extract structured JSON from any webpage using your own schema. AJV-validated output guaranteed. Compatible with Groq, OpenAI, Together AI, and Ollama.

Pricing

$50.00 / 1,000 structured extractions

Rating

0.0

(0)

Developer

Herbert Yeboah

Herbert Yeboah

Maintained by Community

Actor stats

0

Bookmarked

0

Total users

0

Monthly active users

4 days ago

Last modified

Share

Structured Data Extractor

Extract structured JSON from any webpage using a Groq-compatible LLM.

Provide a URL + a JSON Schema → get back validated, structured data. Works with Groq (free), OpenAI, Together AI, Fireworks AI, and Ollama.

Apify Actor PPE Pricing


What It Does

  1. Scrapes the page at your URL using a real browser-grade crawler (CheerioCrawler)
  2. Strips all HTML, navigation, scripts, and boilerplate → clean plain text
  3. Prompts a Groq-compatible LLM to extract data matching your schema
  4. Validates the response with AJV (JSON Schema validator)
  5. Retries up to 3 times if the LLM returns invalid JSON, injecting the error back into the prompt
  6. Returns validated structured data in the Apify dataset

Charge: $0.05 per successful extraction. Nothing charged on failure.


Input Schema

FieldTypeRequiredDefaultDescription
urlstringPage to scrape
output_schemaobjectJSON Schema defining the data to extract
groq_api_keystringAPI key (Groq, OpenAI, Together AI, etc.)
modelstringllama-3.3-70b-versatileModel name
base_urlstringGroq endpointFor OpenAI-compatible providers

Usage Examples

Example 1: Groq (default, free tier)

Get a free API key at console.groq.com.

{
"url": "https://example.com/product/widget-pro",
"groq_api_key": "gsk_YOUR_GROQ_KEY_HERE",
"output_schema": {
"type": "object",
"required": ["name", "price"],
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"description": { "type": "string" },
"in_stock": { "type": "boolean" }
}
}
}

Output:

{
"url": "https://example.com/product/widget-pro",
"extracted": {
"name": "Widget Pro",
"price": 29.99,
"description": "The best widget on the market.",
"in_stock": true
},
"model": "llama-3.3-70b-versatile",
"attempts": 1
}

Example 2: OpenAI-compatible endpoint (Together AI, Fireworks AI)

Use any OpenAI-compatible provider by setting base_url:

{
"url": "https://jobs.lever.co/anthropic/engineer",
"groq_api_key": "YOUR_TOGETHER_AI_KEY",
"base_url": "https://api.together.xyz/v1",
"model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
"output_schema": {
"type": "object",
"required": ["title", "company", "location", "salary_range"],
"properties": {
"title": { "type": "string" },
"company": { "type": "string" },
"location": { "type": "string" },
"salary_range": { "type": "string" },
"remote": { "type": "boolean" },
"requirements": {
"type": "array",
"items": { "type": "string" }
}
}
}
}

Other compatible endpoints:

  • Fireworks AI: https://api.fireworks.ai/inference/v1
  • OpenAI: https://api.openai.com/v1

Example 3: Ollama (local, completely free)

Run models locally at zero cost with Ollama:

# Start Ollama with a model
ollama serve
ollama pull llama3.3
{
"url": "https://news.ycombinator.com/item?id=12345",
"groq_api_key": "ollama",
"base_url": "http://localhost:11434/v1",
"model": "llama3.3",
"output_schema": {
"type": "object",
"required": ["title", "score", "comments_count"],
"properties": {
"title": { "type": "string" },
"score": { "type": "integer" },
"comments_count": { "type": "integer" },
"author": { "type": "string" },
"url": { "type": "string" }
}
}
}

Note: When running the Actor on Apify cloud, Ollama requires a remote endpoint. For local testing, use apify run with localhost.


Common Use Cases

Use CaseSchema Fields
Product extractionname, price, description, in_stock, SKU
Job postingstitle, company, location, salary, requirements
News articlesheadline, author, published_date, summary, tags
Real estate listingsaddress, price, bedrooms, bathrooms, sqft
Restaurant menusrestaurant_name, items (name, price, description)
Resume parsingname, email, skills, experience, education
Event listingsname, date, venue, ticket_price, organizer

How Retry Logic Works

The actor uses the same retry-with-feedback pattern as constrained.py from the DagPipe core library:

  1. Attempt 1: Send text + schema → LLM responds → AJV validates
  2. On failure: Inject the exact AJV error message into the next prompt → retry
  3. Attempt 2: LLM receives error and corrects → validate again
  4. After 3 failures: Throw with a descriptive error message

This approach reliably extracts valid structured data even from smaller/cheaper models.


Pricing

  • $0.05 per successful extraction (Pay-Per-Event)
  • Free if extraction fails — you're never charged for failed attempts
  • Groq's free tier provides 30 requests/minute at zero cost to you

Technical Details

  • Scraper: CheerioCrawler (zero-JS, fast, reliable)
  • Validation: AJV v8 + ajv-formats (JSON Schema Draft-07/2019/2020 compatible)
  • LLM client: OpenAI SDK (works with any OpenAI-compatible endpoint)
  • Retry strategy: Error-feedback prompting (same pattern as DagPipe constrained.py)
  • Language: TypeScript, Node.js 20+
  • Tests: 9 vitest tests (100% passing)

Built With

DagPipe — Zero-cost, crash-proof LLM pipeline orchestrator.

$pip install dagpipe-core