Structured Extract
Pricing
Pay per usage
Structured Extract
Under maintenance0.0 (0)
Pricing
Pay per usage
0
2
1
Last modified
24 days ago
Structured Extract Actor
Extract structured data from any webpage using AI - powered by multiple LLM providers and built with high-performance Bun runtime.
This Apify Actor intelligently extracts structured JSON data from web pages based on natural language descriptions. Instead of writing complex selectors or parsing logic, simply describe what data you want to extract and let AI do the work.
π Features
- Multi-LLM Support: Choose from OpenAI, Anthropic, or Google AI models
- Smart Schema Generation: Automatically creates Zod validation schemas from your data descriptions
- High Performance: Built on Bun runtime for faster execution
- Browser Automation: Uses Playwright for reliable page rendering
- Structured Output: Returns clean, validated JSON data
- Standby Mode: HTTP server mode for programmatic access
π Input Parameters
| Field | Type | Required | Description |
|---|---|---|---|
url | string | β | The webpage URL to extract data from |
prompt | string | β | Natural language description of the data to extract |
provider | string | β | AI provider: openai, anthropic, or google |
modelName | string | β οΈ | Specific model name (optional, uses sensible defaults) |
apiKey | string | β οΈ | Provider API key (can use environment variables) |
Default Models by Provider
- OpenAI:
gpt-4o-mini - Anthropic:
claude-3-7-sonnet-latest - Google:
gemini-2.5-flash-lite
π§ How It Works
- Schema Generation: The Actor uses your natural language prompt to generate a strict Zod validation schema
- Page Loading: Playwright loads the target webpage with full JavaScript rendering
- AI Extraction: LLM-Scraper processes the page content and extracts data according to the schema
- Validation: Results are validated against the generated schema for consistency
- Output: Clean, structured JSON data is returned to the dataset
π Providing the LLM API key (safer via headers)
- Prefer sending the provider API key via request headers when using the standby HTTP server. The server will read the key from headers and ignore any
apiKeyvalue in the request body. - Keep
apiKeyin input for batch/platform runs where headers are not available. It will still work there. - Precedence when resolving the API key:
2.
X-API-Key: <API_KEY>header 3.apiKeyfield in input (primarily for batch/platform runs) 4. Environment variables (OPENAI_API_KEY,ANTHROPIC_API_KEY,GOOGLE_GENERATIVE_AI_API_KEY)
Example (standby server) β header-based key
curl -X POST http://localhost:3000 \-H "Content-Type: application/json" \-H "X-API-Key: PROVIDER_API_KEY" \-d '{"url": "https://example.com","provider": "openai","prompt": "extract contact information"}'
Example β apiKey in input
When calling the Actor as a standalone Actor run (where canβt provide custom headers), you may still use the apiKey field in the Actor input:
{"url": "https://example.com","provider": "google","apiKey": "YOUR_PROVIDER_API_KEY","prompt": "extract news headlines"}
Or configure the API key via environment variables instead:
OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_GENERATIVE_AI_API_KEY
π― Example Usage
Extract Apify Store Data
{"url": "https://apify.com/store","provider": "anthropic","prompt": "extract a list of actors on apify, every actor has a name, author, rating/stars and number of users using the actor"}
Extract E-commerce Products
{"url": "https://example-shop.com/products","provider": "openai","prompt": "extract all products with their names, prices, descriptions, and availability status"}
Extract News Articles
{"url": "https://news-site.com","provider": "google","prompt": "extract all news articles with headline, author, publication date, and summary"}
ποΈ Architecture
βββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ Input Schema β -> β Schema Generator β -> β LLM Scraper ββ (Natural Lang)β β (AI-powered) β β (Playwright) ββββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ| |v vβββββββββββββββββββ ββββββββββββββββββββ Zod Validation β β Structured ββ Schema β β JSON ββββββββββββββββββββ βββββββββββββββββββ
π¦ Running the Actor
Via Apify Console
- Open the Actor in Apify Store
- Configure input parameters
- Click "Start"
Via Apify CLI
apify call structured-extract --input '{"url": "https://example.com","provider": "anthropic","prompt": "extract product information"}'
Via API
curl -X POST https://api.apify.com/v2/acts/structured-extract/runs \-H "Authorization: Bearer YOUR_API_TOKEN" \-H "Content-Type: application/json" \-d '{"url": "https://example.com","provider": "openai","prompt": "extract contact information","apiKey": "YOUR_PROVIDER_API_KEY" # or use environment variable on the Actor}'
π Output Format
The Actor outputs structured JSON data validated by automatically generated Zod schemas. Example output:
{"items": [{"name": "Web Scraper","author": "Apify","rating": 4.8,"users": 12450}],"meta": {"sourceUrl": "https://apify.com/store","scrapedAt": "2024-01-15T10:30:00Z"}}
π‘ Best Practices
- Be Specific: Provide detailed descriptions of the data you want to extract
- Include Context: Mention the website type and data structure in your prompt
- Handle Rate Limits: Be mindful of API rate limits for your chosen AI provider
- Validate Output: The actor automatically validates output against the generated schema
π€ Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
π License
This project is licensed under the ISC License.
π Support
Built with β€οΈ using the Apify Platform and powered by cutting-edge AI technology.
On this page
Share Actor:
