This Apify Actor intelligently extracts structured JSON data from web pages based on natural language descriptions. Instead of writing complex selectors or parsing logic, simply describe what data you want to extract and let AI do the work.

🚀 Features

Multi-LLM Support: Choose from OpenAI, Anthropic, or Google AI models
Smart Schema Generation: Automatically creates Zod validation schemas from your data descriptions
High Performance: Built on Bun runtime for faster execution
Browser Automation: Uses Playwright for reliable page rendering
Structured Output: Returns clean, validated JSON data
Standby Mode: HTTP server mode for programmatic access

📋 Input Parameters

Field	Type	Required	Description
`url`	string	✅	The webpage URL to extract data from
`prompt`	string	✅	Natural language description of the data to extract
`provider`	string	✅	AI provider: `openai`, `anthropic`, or `google`
`modelName`	string	⚠️	Specific model name (optional, uses sensible defaults)
`apiKey`	string	⚠️	Provider API key (can use environment variables)

Default Models by Provider

OpenAI: gpt-4o-mini
Anthropic: claude-3-7-sonnet-latest
Google: gemini-2.5-flash-lite

🔧 How It Works

Schema Generation: The Actor uses your natural language prompt to generate a strict Zod validation schema
Page Loading: Playwright loads the target webpage with full JavaScript rendering
AI Extraction: LLM-Scraper processes the page content and extracts data according to the schema
Validation: Results are validated against the generated schema for consistency
Output: Clean, structured JSON data is returned to the dataset

🔐 Providing the LLM API key (safer via headers)

Prefer sending the provider API key via request headers when using the standby HTTP server. The server will read the key from headers and ignore any apiKey value in the request body.
Keep apiKey in input for batch/platform runs where headers are not available. It will still work there.
Precedence when resolving the API key: 2. X-API-Key: <API_KEY> header 3. apiKey field in input (primarily for batch/platform runs) 4. Environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_GENERATIVE_AI_API_KEY)

Example (standby server) — header-based key

curl -X POST http://localhost:3000 \
  -H "Content-Type: application/json" \
  -H "X-API-Key: PROVIDER_API_KEY" \
  -d '{
    "url": "https://example.com",
    "provider": "openai",
    "prompt": "extract contact information"
  }'

Example — `apiKey` in input

When calling the Actor as a standalone Actor run (where can’t provide custom headers), you may still use the apiKey field in the Actor input:

{
    "url": "https://example.com",
    "provider": "google",
    "apiKey": "YOUR_PROVIDER_API_KEY",
    "prompt": "extract news headlines"
}

Or configure the API key via environment variables instead:

OPENAI_API_KEY
ANTHROPIC_API_KEY
GOOGLE_GENERATIVE_AI_API_KEY

🎯 Example Usage

Extract Apify Store Data

{
    "url": "https://apify.com/store",
    "provider": "anthropic",
    "prompt": "extract a list of actors on apify, every actor has a name, author, rating/stars and number of users using the actor"
}

Extract E-commerce Products

{
    "url": "https://example-shop.com/products",
    "provider": "openai",
    "prompt": "extract all products with their names, prices, descriptions, and availability status"
}

Extract News Articles

{
    "url": "https://news-site.com",
    "provider": "google",
    "prompt": "extract all news articles with headline, author, publication date, and summary"
}

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Input Schema  │ -> │  Schema Generator │ -> │  LLM Scraper   │
│   (Natural Lang)│    │   (AI-powered)   │    │  (Playwright)  │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                |                        |
                                v                        v
                       ┌─────────────────┐    ┌─────────────────┐
                       │  Zod Validation │    │  Structured     │
                       │     Schema      │    │     JSON        │
                       └─────────────────┘    └─────────────────┘

🚦 Running the Actor

Via Apify Console

Open the Actor in Apify Store
Configure input parameters
Click "Start"

Via Apify CLI

apify call structured-extract --input '{
  "url": "https://example.com",
  "provider": "anthropic",
  "prompt": "extract product information"
}'

Via API

curl -X POST https://api.apify.com/v2/acts/structured-extract/runs \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "provider": "openai",
    "prompt": "extract contact information",
    "apiKey": "YOUR_PROVIDER_API_KEY" # or use environment variable on the Actor
  }'

📊 Output Format

The Actor outputs structured JSON data validated by automatically generated Zod schemas. Example output:

{
    "items": [
        {
            "name": "Web Scraper",
            "author": "Apify",
            "rating": 4.8,
            "users": 12450
        }
    ],
    "meta": {
        "sourceUrl": "https://apify.com/store",
        "scrapedAt": "2024-01-15T10:30:00Z"
    }
}

💡 Best Practices

Be Specific: Provide detailed descriptions of the data you want to extract
Include Context: Mention the website type and data structure in your prompt
Handle Rate Limits: Be mindful of API rate limits for your chosen AI provider
Validate Output: The actor automatically validates output against the generated schema

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the ISC License.

🆘 Support

Built with ❤️ using the Apify Platform and powered by cutting-edge AI technology.

Test Youtube Structured Transcript Extractor

karamelo/test-youtube-structured-transcript-extractor

Extract YouTube transcripts instantly. Save time & effort . Get accurate structured transcripts or captions in seconds. Export in various formats,, JSON, XML, HTML, CSV, EXCEL... Boost your productivity!

karamelo

306

YouTube Structured Transcript Extractor

karamelo/youtube-structured-transcript-extractor

Extract 1 or 1000s of YouTube transcripts fast. Save time & effort . Get accurate structured transcripts or captions in seconds for each video. Export in various formats,, JSON, XML, HTML, CSV, EXCEL... Boost your productivity!

karamelo

237

YouTube Video Data Extractor

sandaliaapps/YouTube-Video-Data-Extractor

Extract detailed and structured data from YouTube videos, channels, and playlists — no API key required.

Sandalia Apps

💡 Indeed Jobs Scraper

valig/indeed-jobs-scraper

🔎 Automatically extract job listings from Indeed with detailed filters, structured output, and global support. 📊🌐

Vali G

348

5.0

Fast YouTube Search Results Scraper

matthewjames/fast-youtube-search-results-scraper

Quickly returns detailed structured data based on search queries.

Matthew James

5.0

Extract Website With URL

mrahil/extract-website-with-url

The Extract Website with URL API allows users to extract structured data from any webpage by providing a URL. It retrieves HTML, metadata, tables, and images, returning data in JSON format. Ideal for web scraping, SEO analysis, and content extraction. Use it for e-commerce data, news scraping

Mohammed Rahil

158

eBay-Scraper

mehmet_nadi/ebay-scraper

Powerful eBay scraper to extract product data, prices, seller info, categories, images, and more. Export structured data in JSON for analysis, reports, or integration.

Mehmet Nadi

1.0

Linkedin Post Scraper

logical_scrapers/linkedin-post-scraper

Extract rich, structured data from LinkedIn posts with our high-performance, AI-friendly scraping solution. Perfect for content analysis, social listening, and market research.

Goldmine

YouTube Video Metadata Scraper 🎥

easyapi/youtube-video-metadata-scraper

Extract rich metadata from YouTube videos including title, views, likes, description and more. Simply input video URLs and get structured JSON data for analysis or integration.