Structured Extract avatar
Structured Extract
Under maintenance

Pricing

Pay per usage

Go to Apify Store
Structured Extract

Structured Extract

Under maintenance

Developed by

Roman Rostar

Roman Rostar

Maintained by Community

0.0 (0)

Pricing

Pay per usage

0

2

1

Last modified

24 days ago

Structured Extract Actor

Extract structured data from any webpage using AI - powered by multiple LLM providers and built with high-performance Bun runtime.

This Apify Actor intelligently extracts structured JSON data from web pages based on natural language descriptions. Instead of writing complex selectors or parsing logic, simply describe what data you want to extract and let AI do the work.

πŸš€ Features

  • Multi-LLM Support: Choose from OpenAI, Anthropic, or Google AI models
  • Smart Schema Generation: Automatically creates Zod validation schemas from your data descriptions
  • High Performance: Built on Bun runtime for faster execution
  • Browser Automation: Uses Playwright for reliable page rendering
  • Structured Output: Returns clean, validated JSON data
  • Standby Mode: HTTP server mode for programmatic access

πŸ“‹ Input Parameters

FieldTypeRequiredDescription
urlstringβœ…The webpage URL to extract data from
promptstringβœ…Natural language description of the data to extract
providerstringβœ…AI provider: openai, anthropic, or google
modelNamestring⚠️Specific model name (optional, uses sensible defaults)
apiKeystring⚠️Provider API key (can use environment variables)

Default Models by Provider

  • OpenAI: gpt-4o-mini
  • Anthropic: claude-3-7-sonnet-latest
  • Google: gemini-2.5-flash-lite

πŸ”§ How It Works

  1. Schema Generation: The Actor uses your natural language prompt to generate a strict Zod validation schema
  2. Page Loading: Playwright loads the target webpage with full JavaScript rendering
  3. AI Extraction: LLM-Scraper processes the page content and extracts data according to the schema
  4. Validation: Results are validated against the generated schema for consistency
  5. Output: Clean, structured JSON data is returned to the dataset

πŸ” Providing the LLM API key (safer via headers)

  • Prefer sending the provider API key via request headers when using the standby HTTP server. The server will read the key from headers and ignore any apiKey value in the request body.
  • Keep apiKey in input for batch/platform runs where headers are not available. It will still work there.
  • Precedence when resolving the API key: 2. X-API-Key: <API_KEY> header 3. apiKey field in input (primarily for batch/platform runs) 4. Environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_GENERATIVE_AI_API_KEY)

Example (standby server) β€” header-based key

curl -X POST http://localhost:3000 \
-H "Content-Type: application/json" \
-H "X-API-Key: PROVIDER_API_KEY" \
-d '{
"url": "https://example.com",
"provider": "openai",
"prompt": "extract contact information"
}'

Example β€” apiKey in input

When calling the Actor as a standalone Actor run (where can’t provide custom headers), you may still use the apiKey field in the Actor input:

{
"url": "https://example.com",
"provider": "google",
"apiKey": "YOUR_PROVIDER_API_KEY",
"prompt": "extract news headlines"
}

Or configure the API key via environment variables instead:

  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • GOOGLE_GENERATIVE_AI_API_KEY

🎯 Example Usage

Extract Apify Store Data

{
"url": "https://apify.com/store",
"provider": "anthropic",
"prompt": "extract a list of actors on apify, every actor has a name, author, rating/stars and number of users using the actor"
}

Extract E-commerce Products

{
"url": "https://example-shop.com/products",
"provider": "openai",
"prompt": "extract all products with their names, prices, descriptions, and availability status"
}

Extract News Articles

{
"url": "https://news-site.com",
"provider": "google",
"prompt": "extract all news articles with headline, author, publication date, and summary"
}

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Input Schema β”‚ -> β”‚ Schema Generator β”‚ -> β”‚ LLM Scraper β”‚
β”‚ (Natural Lang)β”‚ β”‚ (AI-powered) β”‚ β”‚ (Playwright) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
| |
v v
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Zod Validation β”‚ β”‚ Structured β”‚
β”‚ Schema β”‚ β”‚ JSON β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🚦 Running the Actor

Via Apify Console

  1. Open the Actor in Apify Store
  2. Configure input parameters
  3. Click "Start"

Via Apify CLI

apify call structured-extract --input '{
"url": "https://example.com",
"provider": "anthropic",
"prompt": "extract product information"
}'

Via API

curl -X POST https://api.apify.com/v2/acts/structured-extract/runs \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"provider": "openai",
"prompt": "extract contact information",
"apiKey": "YOUR_PROVIDER_API_KEY" # or use environment variable on the Actor
}'

πŸ“Š Output Format

The Actor outputs structured JSON data validated by automatically generated Zod schemas. Example output:

{
"items": [
{
"name": "Web Scraper",
"author": "Apify",
"rating": 4.8,
"users": 12450
}
],
"meta": {
"sourceUrl": "https://apify.com/store",
"scrapedAt": "2024-01-15T10:30:00Z"
}
}

πŸ’‘ Best Practices

  1. Be Specific: Provide detailed descriptions of the data you want to extract
  2. Include Context: Mention the website type and data structure in your prompt
  3. Handle Rate Limits: Be mindful of API rate limits for your chosen AI provider
  4. Validate Output: The actor automatically validates output against the generated schema

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the ISC License.

πŸ†˜ Support


Built with ❀️ using the Apify Platform and powered by cutting-edge AI technology.