Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA avatar
Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA
Under maintenance

Pricing

from $13.00 / 1,000 results

Go to Apify Store
Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

Under maintenance

AI-driven hybrid web scraper that merges Playwright and Vision intelligence to extract structured data from any dynamic site. Schema-aware, proxy-ready, budget-safe, and fully compatible with Apify datasets.

Pricing

from $13.00 / 1,000 results

Rating

5.0

(1)

Developer

Țugui Dragoș

Țugui Dragoș

Maintained by Community

Actor stats

1

Bookmarked

3

Total users

1

Monthly active users

7 days ago

Last modified

Share

Hybrid Vision Spider | AI-Powered Universal Web Scraper

AI-driven hybrid web scraper that merges Playwright automation and Vision intelligence to extract structured data from any dynamic site.

Features

  • Hybrid Scraping: Combines fast HTML parsing with AI-powered visual analysis
  • Multi-Engine Support: Choose between Chromium or Firefox browsers
  • Schema-Based Extraction: Define your desired output structure using JSON Schema
  • Intelligent Heuristics: Auto-detect emails, phone numbers, prices, dates, and outbound URLs when present
  • Token Budget Control: Set limits on Vision API usage to control costs
  • Proxy Support: Built-in Apify proxy integration for anti-bot protection
  • Flexible Modes: HTML-only, Vision-only, or Hybrid strategies
  • Per-Run Secrets: Override the OpenAI key on a per-run basis via openAiApiKey

What the spider captures

  • Structured fields – anything you describe in the schema input (product cards, job listings, knowledge panels, etc.).
  • Vision understanding – GPT-4o-mini Vision reads pricing tables, feature boxes, hero banners, or embedded text that pure HTML parsers miss.
  • Automatic heuristics – if your schema contains fields like email, phone, price, date, or externalUrl, the spider will auto-detect them directly from the HTML.
  • Raw artefacts – every run stores full HTML and PNG screenshots so you can debug and audit results.
  • Confidence telemetry – dataset items include per-field confidence scores, an average score, and a report of missing required fields.

Quick Start

Prerequisites

You'll need two API keys:

  1. Apify Token - Get it from Apify Console
  2. OpenAI API Key - Get it from OpenAI Platform

Setup

  1. Clone and install dependencies:
cd hybrid-vision-spider
npm install
  1. Configure API keys:

Copy the .env.example file to .env and add your keys:

$cp .env.example .env

Edit .env (optional if you plan to pass openAiApiKey in the Actor input):

APIFY_TOKEN=apify_api_xxxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx
  1. Set up Apify secret for production (or rely on per-run openAiApiKey):
apify login
apify secrets:add OPENAI_API_KEY "your-openai-api-key-here"

Local Development

Run the Actor locally using the test input in .actor/INPUT.json:

$apify run

Or with custom input:

$apify run --input-file my-input.json

To override the OpenAI credential for a single run:

$apify run --input='{"urls":["https://example.com"],"mode":"vision-only","openAiApiKey":"sk-..."}'

Input fields explained

  • Start URLs (simple list) – paste URLs one per line in the UI when you need a quick crawl. The actor normalizes and deduplicates them automatically.
  • Advanced Request Sources – when you need HTTP method overrides, custom headers, or userData, use the advanced request list editor.
  • Modehybrid strikes the best balance (HTML heuristics first, Vision only when necessary). Fall back to html-only for static pages or vision-only for fully rendered experiences.
  • Max Results – set to 0 for unlimited. The spider comfortably handles 1,000+ records per run when your Vision budget allows it.
  • Vision Token Budget / Max Vision Pages – caps OpenAI usage so a runaway crawl can’t surprise your wallet.

Build TypeScript before running:

npm run build
npm start

Deployment to Apify Platform

  1. Login to Apify:
$apify login
  1. Deploy the Actor:
$apify push

The Actor will be available in your Apify Console at: https://console.apify.com/actors

Output

  • Dataset view – Use the overview view linked from the Output tab to browse URL, method, structured data, confidence, missing fields, tokens, and timestamps.
  • Artifacts – Screenshots (screenshot-*.png) and HTML files (html-*.html) are grouped into collections in the default key-value store.
  • Run stats – A STATS record in the key-value store keeps totals for pages processed, tokens consumed, and error breakdowns.

💰 Cost Control & Pricing

Understanding Costs

Hybrid Vision Spider uses a Pay-Per-Event pricing model with transparent token tracking:

  • HTML-only mode: ~0.001 credits/page (cheapest)
  • Hybrid mode: 0.001 - 0.05 credits/page (adaptive)
  • Vision-only mode: 0.02 - 0.10 credits/page (most accurate)

Budget Controls

  1. maxVisionPages: Hard limit on pages processed with Vision AI
  2. visionTokenBudget: Token budget for OpenAI API calls
  3. 90% warning: Automatic warning at 90% budget consumption
  4. Graceful degradation: Falls back to HTML-only when budget exhausted

Example Cost Calculation

100 pages × hybrid mode = ~2-5 credits
100 pages × vision-only = ~5-10 credits

🔒 Security & Privacy

API Key Management

  • Store secrets only in Apify Secrets (OPENAI_API_KEY)
  • Never log API keys or tokens
  • All logs automatically sanitized

Webhook Security

Optional webhook support with HMAC SHA-256 signature verification:

X-Signature: <hmac-sha256-signature>

⚖️ Compliance & Responsible Use

  • You are fully responsible for how you process the extracted data. By running the actor you acknowledge you will follow GDPR, AICPA, SOC 2, CCPA, and all applicable local regulations.
  • Always respect websites' Terms of Service and robots.txt directives.
  • Store, secure, and delete personal data according to the legal framework governing your organization.

⚡ Performance Optimization

Heuristic Pre-filtering

Common fields (email, phone, price) extracted via regex before Vision API:

  • ~75% confidence for regex matches
  • Saves tokens by skipping Vision for simple fields
  • Faster extraction for structured data

Deduplication

Automatic duplicate detection based on:

  • URL
  • Key fields (title, id, etc.)
  • Content hash (MD5)

Adaptive Sampling

If multiple pages fail → automatically switches to Vision-only mode for reliability.

📊 Output Schema

Dataset Item Structure

{
"url": "string",
"method": "html-only|html-heuristic|vision|vision-retry",
"data": { /* extracted fields per schema */ },
"confidence": { /* per-field confidence 0-1 */ },
"confidenceAverage": 0.87,
"missingFields": ["fieldName"],
"tokensUsed": 0,
"screenshotKey": "screenshot-*.png",
"htmlKey": "html-*.html",
"sources": {
"heuristics": ["price"],
"vision": ["title", "description"]
},
"timestamp": "ISO 8601",
"error": "optional error message"
}

STATS.json

{
"pagesProcessed": 100,
"visionPagesUsed": 37,
"totalTokens": 45678,
"itemsExtracted": 97,
"errors": 3,
"avgTokensPerPage": 1234,
"durationSec": 542
}

Input Configuration

Required Fields

  • urls (array of strings): List of URLs to scrape
  • schema (object): JSON Schema defining the expected output structure

Optional Fields

  • mode (string): Scraping strategy

    • hybrid (default): Try HTML first, use Vision as fallback
    • html-only: Fast, cost-free HTML parsing only
    • vision-only: AI-powered visual extraction
  • engine (string): Browser engine selection

    • chromium (default)
    • firefox
    • camoufox (stealth mode)
  • useProxy (boolean): Enable Apify proxy (default: false)

  • maxResults (integer): Maximum items to extract (default: 100, 0 = unlimited)

  • maxVisionPages (integer): Maximum pages to process with Vision API (default: 10)

  • visionTokenBudget (integer): Total token limit for Vision API calls (default: 50000)

  • openAiApiKey (string, nullable): Override the OpenAI API key for this run

Example Input

E-commerce Product Scraper

{
"urls": [
"https://example.com/products/item-1",
"https://example.com/products/item-2"
],
"mode": "hybrid",
"engine": "chromium",
"useProxy": false,
"maxResults": 100,
"maxVisionPages": 10,
"visionTokenBudget": 50000,
"schema": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Product title"
},
"price": {
"type": "number",
"description": "Product price"
},
"description": {
"type": "string",
"description": "Product description"
},
"availability": {
"type": "string",
"description": "Stock status"
}
},
"required": ["title", "price"]
}
}

Documentation Scraper

{
"urls": [
"https://docs.example.com/page1",
"https://docs.example.com/page2"
],
"mode": "hybrid",
"engine": "chromium",
"useProxy": false,
"maxResults": 50,
"maxVisionPages": 5,
"visionTokenBudget": 20000,
"schema": {
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Page title"
},
"description": {
"type": "string",
"description": "Meta description or summary"
},
"headings": {
"type": "array",
"items": {"type": "string"},
"description": "Main section headings"
},
"codeExamples": {
"type": "array",
"items": {"type": "string"},
"description": "Code snippets found on page"
}
},
"required": ["title"]
}
}

Output

Data is stored in the default dataset with the following structure:

{
"url": "https://example.com/product",
"timestamp": "2024-01-01T12:00:00.000Z",
"method": "vision",
"confidence": 0.95,
"tokensUsed": 1250,
"data": {
"title": "Example Product",
"price": 99.99,
"description": "Product description...",
"availability": "In Stock"
}
}

Output Fields

  • url: The scraped page URL
  • timestamp: ISO 8601 timestamp of extraction
  • method: Extraction method used (html, vision, or hybrid)
  • confidence: Confidence score (0-1) for vision extractions
  • tokensUsed: Number of OpenAI tokens consumed
  • data: Extracted data matching your schema

How It Works

  1. URL Processing: Each URL is processed sequentially
  2. HTML Extraction: Fast initial attempt using CheerioCrawler (hybrid/html-only modes)
  3. Browser Rendering: If needed, launches Playwright to render JavaScript-heavy pages
  4. Screenshot Capture: Full-page screenshot for visual analysis
  5. Vision Analysis: Sends HTML + screenshot to OpenAI GPT-4o-mini for extraction
  6. Schema Validation: Validates extracted data against provided JSON Schema
  7. Data Storage: Saves validated results to Apify Dataset

Cost Optimization

  • Use html-only mode when possible to avoid Vision API costs
  • Set appropriate maxVisionPages and visionTokenBudget limits
  • Vision API uses GPT-4o-mini for cost-effective extraction
  • HTML content is truncated to 200KB to reduce token usage

Example Costs

OpenAI GPT-4o-mini pricing (as of 2024):

  • Input: $0.15 per 1M tokens
  • Output: $0.60 per 1M tokens

Typical page processing:

  • HTML page: ~2,000-5,000 tokens
  • Screenshot: ~1,500-3,000 tokens (vision tokens)
  • Cost per page: ~$0.001-0.003

Billing & PPE Events

The Actor emits Platform Performance Events (PPE) for transparent billing tracking:

  • Event: extraction_succeeded
  • Metadata: { tokens: number, pages: number }
  • When: After each successful Vision API extraction

These events allow you to:

  • Track exact token consumption per run
  • Monitor costs in real-time via Apify Console
  • Set up alerts for budget limits
  • Analyze extraction efficiency

Example PPE event:

{
"eventType": "extraction_succeeded",
"metadata": {
"tokens": 3450,
"pages": 1
}
}

Security

API Key Management

The Actor implements multiple security layers for API keys:

Log Sanitization: All logs automatically scrub sensitive data

  • OpenAI API keys (sk-***)
  • Apify tokens (apify_api_***)
  • Bearer tokens (Bearer ***)
  • Environment variables containing API keys

Secret Storage: Production API keys should be stored in Apify Secrets:

$apify secrets:add OPENAI_API_KEY "your-key-here"

Never commit API keys to version control or logs.

Webhook Security

The Actor includes webhook signature verification utilities in src/security.ts:

HMAC Signature Verification:

import { verifyWebhookSignature } from './security';
const isValid = verifyWebhookSignature(
payloadString,
signatureHeader,
webhookSecret
);

Secret Sanitization:

import { sanitizeSecrets } from './security';
const safeData = sanitizeSecrets(requestData);
console.log(safeData);

Rate Limiting

The Actor respects rate limits and implements polite crawling:

  • Default Delay: 1 second between pages
  • Respects: robots.txt directives
  • Configurable: Adjust delay in src/main.ts

Rate limiting prevents:

  • Server overload
  • IP bans
  • API throttling
  • Violating Terms of Service

The built-in rateLimit() function ensures consistent delays between requests.

Technical Stack

  • Runtime: Node.js 20+ with TypeScript
  • Crawler: Crawlee 3.x (Cheerio + Playwright)
  • Vision: OpenAI GPT-4o-mini
  • Validation: AJV JSON Schema validator
  • Python: 3.11 for Vision agent integration

Error Handling

  • Continues processing remaining URLs even if individual pages fail
  • Logs detailed error messages for debugging
  • Validates output against schema with warnings for non-conforming data
  • Tracks token usage and respects budget limits

Configuration Files

.env.example

Template for local environment variables:

APIFY_TOKEN=your_apify_token_here
OPENAI_API_KEY=your_openai_api_key_here

.actor/INPUT.json

Test input for local development with sample URLs and schema.

apify.json

Apify platform configuration - references the OpenAI API key from Apify secrets.

Limitations

  • Vision API has rate limits and costs associated with usage
  • Complex pages may require higher token budgets
  • Screenshot size affects Vision API processing time
  • Proxy usage requires Apify paid plan

Troubleshooting

"OpenAI API key not found"

Symptoms: Actor fails immediately with authentication error

Solution:

  • Local Development: Create .env file with OPENAI_API_KEY=sk-...
  • Production: Store in Apify Secrets:
    $apify secrets:add OPENAI_API_KEY "sk-..."
  • Verification: Check .actor/actor.json references @openai_api_key

"Rate limit exceeded"

Symptoms: 429 errors from OpenAI API

Solutions:

  1. Reduce maxVisionPages to limit concurrent requests
  2. Increase rate limit delays in src/main.ts:
    await rateLimit(2000); // 2 seconds
  3. Upgrade OpenAI API tier for higher limits
  4. Implement exponential backoff for retries

"Token budget exceeded"

Symptoms: Actor stops processing pages mid-run

Solutions:

  1. Increase visionTokenBudget in input (default: 50000)
  2. Switch to html-only mode for simple pages
  3. Reduce maxVisionPages to process fewer pages
  4. HTML is auto-truncated to 200KB to minimize tokens

"Schema validation failed"

Symptoms: Warning logs about schema mismatches

Solutions:

  1. Review py/vision_agent.py confidence scores
  2. Simplify schema - remove optional fields
  3. Add more descriptive field descriptions
  4. Check if required fields are too strict
  5. Validate schema using JSON Schema validator

"Python process exited with code 1"

Symptoms: Vision extraction fails with Python errors

Solutions:

  1. Check stderr logs for detailed error messages
  2. Verify Python 3.11 is available in Docker
  3. Ensure py/requirements.txt dependencies are installed
  4. Check HTML truncation isn't breaking JSON parsing
  5. Validate screenshot is valid PNG format

"API keys visible in logs"

Symptoms: Sensitive data appears in Actor logs

Solutions:

  • This should never happen - the sanitizeForLog() function scrubs all keys
  • If it does, report immediately as a security issue
  • Rotate compromised API keys immediately
  • Check custom logging doesn't bypass sanitization

Memory or Timeout Issues

Symptoms: Actor crashes or times out

Solutions:

  1. Reduce maxVisionPages to lower memory usage
  2. Increase requestHandlerTimeoutSecs for slow pages
  3. Use html-only mode to avoid browser overhead
  4. Process URLs in smaller batches
  5. Upgrade Actor memory allocation in Apify Console

Resources

Support

For issues or questions:

License

Apache 2.0