Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

Under maintenance

Pricing

from $13.00 / 1,000 results

Try for free

Go to Apify Store

Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

Under maintenance

Try for free

AI-driven hybrid web scraper that merges Playwright and Vision intelligence to extract structured data from any dynamic site. Schema-aware, proxy-ready, budget-safe, and fully compatible with Apify datasets.

Pricing

from $13.00 / 1,000 results

Rating

5.0

(1)

Developer

Țugui Dragoș

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

7 days ago

Last modified

Hybrid Vision Spider | AI-Powered Universal Web Scraper

AI-driven hybrid web scraper that merges Playwright automation and Vision intelligence to extract structured data from any dynamic site.

Features

Hybrid Scraping: Combines fast HTML parsing with AI-powered visual analysis
Multi-Engine Support: Choose between Chromium or Firefox browsers
Schema-Based Extraction: Define your desired output structure using JSON Schema
Intelligent Heuristics: Auto-detect emails, phone numbers, prices, dates, and outbound URLs when present
Token Budget Control: Set limits on Vision API usage to control costs
Proxy Support: Built-in Apify proxy integration for anti-bot protection
Flexible Modes: HTML-only, Vision-only, or Hybrid strategies
Per-Run Secrets: Override the OpenAI key on a per-run basis via openAiApiKey

What the spider captures

Structured fields – anything you describe in the schema input (product cards, job listings, knowledge panels, etc.).
Vision understanding – GPT-4o-mini Vision reads pricing tables, feature boxes, hero banners, or embedded text that pure HTML parsers miss.
Automatic heuristics – if your schema contains fields like email, phone, price, date, or externalUrl, the spider will auto-detect them directly from the HTML.
Raw artefacts – every run stores full HTML and PNG screenshots so you can debug and audit results.
Confidence telemetry – dataset items include per-field confidence scores, an average score, and a report of missing required fields.

Quick Start

Prerequisites

You'll need two API keys:

Apify Token - Get it from Apify Console
OpenAI API Key - Get it from OpenAI Platform

Setup

Clone and install dependencies:

cd hybrid-vision-spider
npm install

Configure API keys:

Copy the .env.example file to .env and add your keys:

$cp .env.example .env

Edit .env (optional if you plan to pass openAiApiKey in the Actor input):

APIFY_TOKEN=apify_api_xxxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx

Set up Apify secret for production (or rely on per-run openAiApiKey):

apify login
apify secrets:add OPENAI_API_KEY "your-openai-api-key-here"

Local Development

Run the Actor locally using the test input in .actor/INPUT.json:

$apify run

Or with custom input:

$apify run --input-file my-input.json

To override the OpenAI credential for a single run:

$apify run --input='{"urls":["https://example.com"],"mode":"vision-only","openAiApiKey":"sk-..."}'

Input fields explained

Start URLs (simple list) – paste URLs one per line in the UI when you need a quick crawl. The actor normalizes and deduplicates them automatically.
Advanced Request Sources – when you need HTTP method overrides, custom headers, or userData, use the advanced request list editor.
Mode – hybrid strikes the best balance (HTML heuristics first, Vision only when necessary). Fall back to html-only for static pages or vision-only for fully rendered experiences.
Max Results – set to 0 for unlimited. The spider comfortably handles 1,000+ records per run when your Vision budget allows it.
Vision Token Budget / Max Vision Pages – caps OpenAI usage so a runaway crawl can’t surprise your wallet.

Build TypeScript before running:

npm run build
npm start

Deployment to Apify Platform

Login to Apify:

$apify login

Deploy the Actor:

$apify push

The Actor will be available in your Apify Console at: https://console.apify.com/actors

Output

Dataset view – Use the overview view linked from the Output tab to browse URL, method, structured data, confidence, missing fields, tokens, and timestamps.
Artifacts – Screenshots (screenshot-*.png) and HTML files (html-*.html) are grouped into collections in the default key-value store.
Run stats – A STATS record in the key-value store keeps totals for pages processed, tokens consumed, and error breakdowns.

💰 Cost Control & Pricing

Understanding Costs

Hybrid Vision Spider uses a Pay-Per-Event pricing model with transparent token tracking:

HTML-only mode: ~0.001 credits/page (cheapest)
Hybrid mode: 0.001 - 0.05 credits/page (adaptive)
Vision-only mode: 0.02 - 0.10 credits/page (most accurate)

Budget Controls

maxVisionPages: Hard limit on pages processed with Vision AI
visionTokenBudget: Token budget for OpenAI API calls
90% warning: Automatic warning at 90% budget consumption
Graceful degradation: Falls back to HTML-only when budget exhausted

Example Cost Calculation

100 pages × hybrid mode = ~2-5 credits
100 pages × vision-only = ~5-10 credits

🔒 Security & Privacy

API Key Management

Store secrets only in Apify Secrets (OPENAI_API_KEY)
Never log API keys or tokens
All logs automatically sanitized

Webhook Security

Optional webhook support with HMAC SHA-256 signature verification:

X-Signature: <hmac-sha256-signature>

⚖️ Compliance & Responsible Use

You are fully responsible for how you process the extracted data. By running the actor you acknowledge you will follow GDPR, AICPA, SOC 2, CCPA, and all applicable local regulations.
Always respect websites' Terms of Service and robots.txt directives.
Store, secure, and delete personal data according to the legal framework governing your organization.

⚡ Performance Optimization

Heuristic Pre-filtering

Common fields (email, phone, price) extracted via regex before Vision API:

~75% confidence for regex matches
Saves tokens by skipping Vision for simple fields
Faster extraction for structured data

Deduplication

Automatic duplicate detection based on:

URL
Key fields (title, id, etc.)
Content hash (MD5)

Adaptive Sampling

If multiple pages fail → automatically switches to Vision-only mode for reliability.

📊 Output Schema

Dataset Item Structure

{
  "url": "string",
  "method": "html-only|html-heuristic|vision|vision-retry",
  "data": { /* extracted fields per schema */ },
  "confidence": { /* per-field confidence 0-1 */ },
  "confidenceAverage": 0.87,
  "missingFields": ["fieldName"],
  "tokensUsed": 0,
  "screenshotKey": "screenshot-*.png",
  "htmlKey": "html-*.html",
  "sources": {
    "heuristics": ["price"],
    "vision": ["title", "description"]
  },
  "timestamp": "ISO 8601",
  "error": "optional error message"
}

STATS.json

{
  "pagesProcessed": 100,
  "visionPagesUsed": 37,
  "totalTokens": 45678,
  "itemsExtracted": 97,
  "errors": 3,
  "avgTokensPerPage": 1234,
  "durationSec": 542
}

Input Configuration

Required Fields

urls (array of strings): List of URLs to scrape
schema (object): JSON Schema defining the expected output structure

Optional Fields

mode (string): Scraping strategy
- hybrid (default): Try HTML first, use Vision as fallback
- html-only: Fast, cost-free HTML parsing only
- vision-only: AI-powered visual extraction
engine (string): Browser engine selection
- chromium (default)
- firefox
- camoufox (stealth mode)
useProxy (boolean): Enable Apify proxy (default: false)
maxResults (integer): Maximum items to extract (default: 100, 0 = unlimited)
maxVisionPages (integer): Maximum pages to process with Vision API (default: 10)
visionTokenBudget (integer): Total token limit for Vision API calls (default: 50000)
openAiApiKey (string, nullable): Override the OpenAI API key for this run

Example Input

E-commerce Product Scraper

{
  "urls": [
    "https://example.com/products/item-1",
    "https://example.com/products/item-2"
  ],
  "mode": "hybrid",
  "engine": "chromium",
  "useProxy": false,
  "maxResults": 100,
  "maxVisionPages": 10,
  "visionTokenBudget": 50000,
  "schema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "Product title"
      },
      "price": {
        "type": "number",
        "description": "Product price"
      },
      "description": {
        "type": "string",
        "description": "Product description"
      },
      "availability": {
        "type": "string",
        "description": "Stock status"
      }
    },
    "required": ["title", "price"]
  }
}

Documentation Scraper

{
  "urls": [
    "https://docs.example.com/page1",
    "https://docs.example.com/page2"
  ],
  "mode": "hybrid",
  "engine": "chromium",
  "useProxy": false,
  "maxResults": 50,
  "maxVisionPages": 5,
  "visionTokenBudget": 20000,
  "schema": {
    "type": "object",
    "properties": {
      "title": {
        "type": "string",
        "description": "Page title"
      },
      "description": {
        "type": "string",
        "description": "Meta description or summary"
      },
      "headings": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Main section headings"
      },
      "codeExamples": {
        "type": "array",
        "items": {"type": "string"},
        "description": "Code snippets found on page"
      }
    },
    "required": ["title"]
  }
}

Output

Data is stored in the default dataset with the following structure:

{
  "url": "https://example.com/product",
  "timestamp": "2024-01-01T12:00:00.000Z",
  "method": "vision",
  "confidence": 0.95,
  "tokensUsed": 1250,
  "data": {
    "title": "Example Product",
    "price": 99.99,
    "description": "Product description...",
    "availability": "In Stock"
  }
}

Output Fields

url: The scraped page URL
timestamp: ISO 8601 timestamp of extraction
method: Extraction method used (html, vision, or hybrid)
confidence: Confidence score (0-1) for vision extractions
tokensUsed: Number of OpenAI tokens consumed
data: Extracted data matching your schema

How It Works

URL Processing: Each URL is processed sequentially
HTML Extraction: Fast initial attempt using CheerioCrawler (hybrid/html-only modes)
Browser Rendering: If needed, launches Playwright to render JavaScript-heavy pages
Screenshot Capture: Full-page screenshot for visual analysis
Vision Analysis: Sends HTML + screenshot to OpenAI GPT-4o-mini for extraction
Schema Validation: Validates extracted data against provided JSON Schema
Data Storage: Saves validated results to Apify Dataset

Cost Optimization

Use html-only mode when possible to avoid Vision API costs
Set appropriate maxVisionPages and visionTokenBudget limits
Vision API uses GPT-4o-mini for cost-effective extraction
HTML content is truncated to 200KB to reduce token usage

Example Costs

OpenAI GPT-4o-mini pricing (as of 2024):

Input: $0.15 per 1M tokens
Output: $0.60 per 1M tokens

Typical page processing:

HTML page: ~2,000-5,000 tokens
Screenshot: ~1,500-3,000 tokens (vision tokens)
Cost per page: ~$0.001-0.003

Billing & PPE Events

The Actor emits Platform Performance Events (PPE) for transparent billing tracking:

Event: extraction_succeeded
Metadata: { tokens: number, pages: number }
When: After each successful Vision API extraction

These events allow you to:

Track exact token consumption per run
Monitor costs in real-time via Apify Console
Set up alerts for budget limits
Analyze extraction efficiency

Example PPE event:

{
  "eventType": "extraction_succeeded",
  "metadata": {
    "tokens": 3450,
    "pages": 1
  }
}

Security

API Key Management

The Actor implements multiple security layers for API keys:

Log Sanitization: All logs automatically scrub sensitive data

OpenAI API keys (sk-***)
Apify tokens (apify_api_***)
Bearer tokens (Bearer ***)
Environment variables containing API keys

Secret Storage: Production API keys should be stored in Apify Secrets:

$apify secrets:add OPENAI_API_KEY "your-key-here"

Never commit API keys to version control or logs.

Webhook Security

The Actor includes webhook signature verification utilities in src/security.ts:

HMAC Signature Verification:

import { verifyWebhookSignature } from './security';

const isValid = verifyWebhookSignature(
    payloadString,
    signatureHeader,
    webhookSecret
);

Secret Sanitization:

import { sanitizeSecrets } from './security';

const safeData = sanitizeSecrets(requestData);
console.log(safeData);

Rate Limiting

The Actor respects rate limits and implements polite crawling:

Default Delay: 1 second between pages
Respects: robots.txt directives
Configurable: Adjust delay in src/main.ts

Rate limiting prevents:

Server overload
IP bans
API throttling
Violating Terms of Service

The built-in rateLimit() function ensures consistent delays between requests.

Technical Stack

Runtime: Node.js 20+ with TypeScript
Crawler: Crawlee 3.x (Cheerio + Playwright)
Vision: OpenAI GPT-4o-mini
Validation: AJV JSON Schema validator
Python: 3.11 for Vision agent integration

Error Handling

Continues processing remaining URLs even if individual pages fail
Logs detailed error messages for debugging
Validates output against schema with warnings for non-conforming data
Tracks token usage and respects budget limits

Configuration Files

`.env.example`

Template for local environment variables:

APIFY_TOKEN=your_apify_token_here
OPENAI_API_KEY=your_openai_api_key_here

`.actor/INPUT.json`

Test input for local development with sample URLs and schema.

`apify.json`

Apify platform configuration - references the OpenAI API key from Apify secrets.

Limitations

Vision API has rate limits and costs associated with usage
Complex pages may require higher token budgets
Screenshot size affects Vision API processing time
Proxy usage requires Apify paid plan

Troubleshooting

"OpenAI API key not found"

Symptoms: Actor fails immediately with authentication error

Solution:

Local Development: Create .env file with OPENAI_API_KEY=sk-...

Production: Store in Apify Secrets:

$apify secrets:add OPENAI_API_KEY "sk-..."

Verification: Check .actor/actor.json references @openai_api_key

"Rate limit exceeded"

Symptoms: 429 errors from OpenAI API

Solutions:

Reduce maxVisionPages to limit concurrent requests

Increase rate limit delays in src/main.ts:

await rateLimit(2000); // 2 seconds

Upgrade OpenAI API tier for higher limits
Implement exponential backoff for retries

"Token budget exceeded"

Symptoms: Actor stops processing pages mid-run

Solutions:

Increase visionTokenBudget in input (default: 50000)
Switch to html-only mode for simple pages
Reduce maxVisionPages to process fewer pages
HTML is auto-truncated to 200KB to minimize tokens

"Schema validation failed"

Symptoms: Warning logs about schema mismatches

Solutions:

Review py/vision_agent.py confidence scores
Simplify schema - remove optional fields
Add more descriptive field descriptions
Check if required fields are too strict
Validate schema using JSON Schema validator

"Python process exited with code 1"

Symptoms: Vision extraction fails with Python errors

Solutions:

Check stderr logs for detailed error messages
Verify Python 3.11 is available in Docker
Ensure py/requirements.txt dependencies are installed
Check HTML truncation isn't breaking JSON parsing
Validate screenshot is valid PNG format

"API keys visible in logs"

Symptoms: Sensitive data appears in Actor logs

Solutions:

This should never happen - the sanitizeForLog() function scrubs all keys
If it does, report immediately as a security issue
Rotate compromised API keys immediately
Check custom logging doesn't bypass sanitization

Memory or Timeout Issues

Symptoms: Actor crashes or times out

Solutions:

Reduce maxVisionPages to lower memory usage
Increase requestHandlerTimeoutSecs for slow pages
Use html-only mode to avoid browser overhead
Process URLs in smaller batches
Upgrade Actor memory allocation in Apify Console

Resources

Support

For issues or questions:

License

Apache 2.0

AI Vision Scraper

zscrape/ai-vision-scraper

AI Vision Scraper automates web tasks, navigating sites, solving CAPTCHAs, and extracting data on demand using a single prompt. From competitor tracking to form submissions, it streamlines workflows and automation across industries like e-commerce, sales, and recruiting.

ZScrape Solutions

Education & EdTech Intelligence (AI-Powered)

visita/education-edtech-intelligence

This actor is a powerful data-gathering tool that transforms raw news from top RSS feeds (focused on Education, EdTech, and Learning Trends) into structured, actionable intelligence. It uses DuckDuckGo News Search to gather real-time context and an LLM (OpenAI) to perform advanced analysis.

Visita AI & Automation

MercadoLibre Product Detail Scraper 🛍️ - Cheap

scrapestorm/mercadolibre-product-detail-scraper---cheap

Discover top MercadoLibre products with ⚡fast data extraction! Sort by 🔥 popularity, 🕒 newest listings, or 💰 price. Get key info like product specs, images, seller details & reviews. Perfect for 📊 market analysis, competitor tracking & finding 🛍️ trending items!

Storm_Scraper

5.0

DRG Phantom Core - Genesis Pilot

tuguidragos/drg-phantom-core-genesis-pilot

A stealth-grade autonomous lead intelligence engine that discovers, enriches, analyzes, and qualifies B2B prospects using multi-source scraping and AI scoring. This pilot release showcases the system’s core capabilities and foundational architecture.

Țugui Dragoș

5.0

Mercadolibre Deals Scraper

spider.engine/mercadolibre-deals-scraper

This scraper is a powerful tool designed to extract valuable data from MercadoLibre's e-commerce platform. It specializes in gathering information about deals and promotions across various countries, including Argentina, Brazil, Chile, Colombia, Ecuador, Mexico, Peru, Uruguay, and Venezuela.

Spider Engine

Mercado Libre Product Details Scraper

ecomscrape/mercadolibre-product-details-scraper

Advanced MercadoLibre product scraper for extracting comprehensive seller data, pricing, shipping info, and product details from Latin America's largest e-commerce platform. Perfect for market research, competitor analysis, and business intelligence.

ecomscrape

Mercadolibre Scraper (español castellano)

karamelo/mercadolibre-scraper-espanol-castellano

El Mercadolibre Scraper es una herramienta de raspado web de última generación que te permite recolectar datos públicos de MercadoLibre de forma fácil y eficiente.

karamelo

266

Mercadolibre Email Scraper – Advanced, Cheapest & Reliable 📧

contactminerlabs/mercadolibre-email-scraper---advanced-cheapest-reliable

🔍 Scrape Mass/Bulk Mercadolibre Emails [Cheapest] Enter your search parameters to collect verified contact emails from Mercadolibre profiles, along with profile title, bio, URL & platform info Perfect for lead generation, influencer outreach & data enrichment in tools like Google Sheets or CRMs🧩

ContactMinerLabs

5.0

MercadoLibre Search Scraper 📦🌍🛍️ - Cheap & Advanced dep

scrapestorm/mercadolibre-search-scraper---cheap-advanced-dep

Easily gather product data from MercadoLibre’s online marketplace across Latin America 🌎 Just enter a search or category URL to retrieve key details like product name, price, brand, ratings & images 🔍 Seamlessly integrate with tools to streamline your e-commerce research & boost productivity! ⚡📊

Storm_Scraper

5.0

Mercadolibre Products Spider

getdataforme/mercadolivrebr-products-scraper

The Mercadolivrebr Products Spider scrapes product details from Mercado Livre Brazil, extracting name, description, price, SKU, images, and ratings. Input URLs and get structured JSON for price monitoring, market research, or e-commerce integration. Scalable, reliable, and easy to use.

GetDataForMe

Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA

Hybrid Vision Spider | AI-Powered Universal Web Scraper

Features

What the spider captures

Quick Start

Prerequisites

Setup

Local Development

Input fields explained

Deployment to Apify Platform

Output

💰 Cost Control & Pricing

Understanding Costs

Budget Controls

Example Cost Calculation

🔒 Security & Privacy

API Key Management

Webhook Security

⚖️ Compliance & Responsible Use

⚡ Performance Optimization

Heuristic Pre-filtering

Deduplication

Adaptive Sampling

📊 Output Schema

Dataset Item Structure

STATS.json

Input Configuration

Required Fields

Optional Fields

Example Input

E-commerce Product Scraper

Documentation Scraper

Output

Output Fields

How It Works

Cost Optimization

Example Costs

Billing & PPE Events

Security

API Key Management

Webhook Security

Rate Limiting

Technical Stack

Error Handling

Configuration Files

.env.example

.actor/INPUT.json

apify.json

Limitations

Troubleshooting

"OpenAI API key not found"

"Rate limit exceeded"

"Token budget exceeded"

"Schema validation failed"

"Python process exited with code 1"

"API keys visible in logs"

Memory or Timeout Issues

Resources

Support

License

You might also like

AI Vision Scraper

Education & EdTech Intelligence (AI-Powered)

MercadoLibre Product Detail Scraper 🛍️ - Cheap

DRG Phantom Core - Genesis Pilot

Mercadolibre Deals Scraper

Mercado Libre Product Details Scraper

Mercadolibre Scraper (español castellano)

Mercadolibre Email Scraper – Advanced, Cheapest & Reliable 📧

MercadoLibre Search Scraper 📦🌍🛍️ - Cheap & Advanced dep

Mercadolibre Products Spider

`.env.example`

`.actor/INPUT.json`

`apify.json`