Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA
Pricing
from $13.00 / 1,000 results
Hybrid Vision Spider | AI-Powered Universal Web Scraper BETA
AI-driven hybrid web scraper that merges Playwright and Vision intelligence to extract structured data from any dynamic site. Schema-aware, proxy-ready, budget-safe, and fully compatible with Apify datasets.
Pricing
from $13.00 / 1,000 results
Rating
5.0
(1)
Developer

Țugui Dragoș
Actor stats
1
Bookmarked
3
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
Hybrid Vision Spider | AI-Powered Universal Web Scraper
AI-driven hybrid web scraper that merges Playwright automation and Vision intelligence to extract structured data from any dynamic site.
Features
- Hybrid Scraping: Combines fast HTML parsing with AI-powered visual analysis
- Multi-Engine Support: Choose between Chromium or Firefox browsers
- Schema-Based Extraction: Define your desired output structure using JSON Schema
- Intelligent Heuristics: Auto-detect emails, phone numbers, prices, dates, and outbound URLs when present
- Token Budget Control: Set limits on Vision API usage to control costs
- Proxy Support: Built-in Apify proxy integration for anti-bot protection
- Flexible Modes: HTML-only, Vision-only, or Hybrid strategies
- Per-Run Secrets: Override the OpenAI key on a per-run basis via
openAiApiKey
What the spider captures
- Structured fields – anything you describe in the
schemainput (product cards, job listings, knowledge panels, etc.). - Vision understanding – GPT-4o-mini Vision reads pricing tables, feature boxes, hero banners, or embedded text that pure HTML parsers miss.
- Automatic heuristics – if your schema contains fields like
email,phone,price,date, orexternalUrl, the spider will auto-detect them directly from the HTML. - Raw artefacts – every run stores full HTML and PNG screenshots so you can debug and audit results.
- Confidence telemetry – dataset items include per-field confidence scores, an average score, and a report of missing required fields.
Quick Start
Prerequisites
You'll need two API keys:
- Apify Token - Get it from Apify Console
- OpenAI API Key - Get it from OpenAI Platform
Setup
- Clone and install dependencies:
cd hybrid-vision-spidernpm install
- Configure API keys:
Copy the .env.example file to .env and add your keys:
$cp .env.example .env
Edit .env (optional if you plan to pass openAiApiKey in the Actor input):
APIFY_TOKEN=apify_api_xxxxxxxxxxxxxxxxxxxxxOPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxx
- Set up Apify secret for production (or rely on per-run
openAiApiKey):
apify loginapify secrets:add OPENAI_API_KEY "your-openai-api-key-here"
Local Development
Run the Actor locally using the test input in .actor/INPUT.json:
$apify run
Or with custom input:
$apify run --input-file my-input.json
To override the OpenAI credential for a single run:
$apify run --input='{"urls":["https://example.com"],"mode":"vision-only","openAiApiKey":"sk-..."}'
Input fields explained
- Start URLs (simple list) – paste URLs one per line in the UI when you need a quick crawl. The actor normalizes and deduplicates them automatically.
- Advanced Request Sources – when you need HTTP method overrides, custom headers, or
userData, use the advanced request list editor. - Mode –
hybridstrikes the best balance (HTML heuristics first, Vision only when necessary). Fall back tohtml-onlyfor static pages orvision-onlyfor fully rendered experiences. - Max Results – set to
0for unlimited. The spider comfortably handles 1,000+ records per run when your Vision budget allows it. - Vision Token Budget / Max Vision Pages – caps OpenAI usage so a runaway crawl can’t surprise your wallet.
Build TypeScript before running:
npm run buildnpm start
Deployment to Apify Platform
- Login to Apify:
$apify login
- Deploy the Actor:
$apify push
The Actor will be available in your Apify Console at:
https://console.apify.com/actors
Output
- Dataset view – Use the
overviewview linked from the Output tab to browse URL, method, structured data, confidence, missing fields, tokens, and timestamps. - Artifacts – Screenshots (
screenshot-*.png) and HTML files (html-*.html) are grouped into collections in the default key-value store. - Run stats – A
STATSrecord in the key-value store keeps totals for pages processed, tokens consumed, and error breakdowns.
💰 Cost Control & Pricing
Understanding Costs
Hybrid Vision Spider uses a Pay-Per-Event pricing model with transparent token tracking:
- HTML-only mode: ~0.001 credits/page (cheapest)
- Hybrid mode: 0.001 - 0.05 credits/page (adaptive)
- Vision-only mode: 0.02 - 0.10 credits/page (most accurate)
Budget Controls
- maxVisionPages: Hard limit on pages processed with Vision AI
- visionTokenBudget: Token budget for OpenAI API calls
- 90% warning: Automatic warning at 90% budget consumption
- Graceful degradation: Falls back to HTML-only when budget exhausted
Example Cost Calculation
100 pages × hybrid mode = ~2-5 credits100 pages × vision-only = ~5-10 credits
🔒 Security & Privacy
API Key Management
- Store secrets only in Apify Secrets (
OPENAI_API_KEY) - Never log API keys or tokens
- All logs automatically sanitized
Webhook Security
Optional webhook support with HMAC SHA-256 signature verification:
X-Signature: <hmac-sha256-signature>
⚖️ Compliance & Responsible Use
- You are fully responsible for how you process the extracted data. By running the actor you acknowledge you will follow GDPR, AICPA, SOC 2, CCPA, and all applicable local regulations.
- Always respect websites' Terms of Service and robots.txt directives.
- Store, secure, and delete personal data according to the legal framework governing your organization.
⚡ Performance Optimization
Heuristic Pre-filtering
Common fields (email, phone, price) extracted via regex before Vision API:
- ~75% confidence for regex matches
- Saves tokens by skipping Vision for simple fields
- Faster extraction for structured data
Deduplication
Automatic duplicate detection based on:
- URL
- Key fields (title, id, etc.)
- Content hash (MD5)
Adaptive Sampling
If multiple pages fail → automatically switches to Vision-only mode for reliability.
📊 Output Schema
Dataset Item Structure
{"url": "string","method": "html-only|html-heuristic|vision|vision-retry","data": { /* extracted fields per schema */ },"confidence": { /* per-field confidence 0-1 */ },"confidenceAverage": 0.87,"missingFields": ["fieldName"],"tokensUsed": 0,"screenshotKey": "screenshot-*.png","htmlKey": "html-*.html","sources": {"heuristics": ["price"],"vision": ["title", "description"]},"timestamp": "ISO 8601","error": "optional error message"}
STATS.json
{"pagesProcessed": 100,"visionPagesUsed": 37,"totalTokens": 45678,"itemsExtracted": 97,"errors": 3,"avgTokensPerPage": 1234,"durationSec": 542}
Input Configuration
Required Fields
- urls (array of strings): List of URLs to scrape
- schema (object): JSON Schema defining the expected output structure
Optional Fields
-
mode (string): Scraping strategy
hybrid(default): Try HTML first, use Vision as fallbackhtml-only: Fast, cost-free HTML parsing onlyvision-only: AI-powered visual extraction
-
engine (string): Browser engine selection
chromium(default)firefoxcamoufox(stealth mode)
-
useProxy (boolean): Enable Apify proxy (default: false)
-
maxResults (integer): Maximum items to extract (default: 100, 0 = unlimited)
-
maxVisionPages (integer): Maximum pages to process with Vision API (default: 10)
-
visionTokenBudget (integer): Total token limit for Vision API calls (default: 50000)
-
openAiApiKey (string, nullable): Override the OpenAI API key for this run
Example Input
E-commerce Product Scraper
{"urls": ["https://example.com/products/item-1","https://example.com/products/item-2"],"mode": "hybrid","engine": "chromium","useProxy": false,"maxResults": 100,"maxVisionPages": 10,"visionTokenBudget": 50000,"schema": {"type": "object","properties": {"title": {"type": "string","description": "Product title"},"price": {"type": "number","description": "Product price"},"description": {"type": "string","description": "Product description"},"availability": {"type": "string","description": "Stock status"}},"required": ["title", "price"]}}
Documentation Scraper
{"urls": ["https://docs.example.com/page1","https://docs.example.com/page2"],"mode": "hybrid","engine": "chromium","useProxy": false,"maxResults": 50,"maxVisionPages": 5,"visionTokenBudget": 20000,"schema": {"type": "object","properties": {"title": {"type": "string","description": "Page title"},"description": {"type": "string","description": "Meta description or summary"},"headings": {"type": "array","items": {"type": "string"},"description": "Main section headings"},"codeExamples": {"type": "array","items": {"type": "string"},"description": "Code snippets found on page"}},"required": ["title"]}}
Output
Data is stored in the default dataset with the following structure:
{"url": "https://example.com/product","timestamp": "2024-01-01T12:00:00.000Z","method": "vision","confidence": 0.95,"tokensUsed": 1250,"data": {"title": "Example Product","price": 99.99,"description": "Product description...","availability": "In Stock"}}
Output Fields
- url: The scraped page URL
- timestamp: ISO 8601 timestamp of extraction
- method: Extraction method used (
html,vision, orhybrid) - confidence: Confidence score (0-1) for vision extractions
- tokensUsed: Number of OpenAI tokens consumed
- data: Extracted data matching your schema
How It Works
- URL Processing: Each URL is processed sequentially
- HTML Extraction: Fast initial attempt using CheerioCrawler (hybrid/html-only modes)
- Browser Rendering: If needed, launches Playwright to render JavaScript-heavy pages
- Screenshot Capture: Full-page screenshot for visual analysis
- Vision Analysis: Sends HTML + screenshot to OpenAI GPT-4o-mini for extraction
- Schema Validation: Validates extracted data against provided JSON Schema
- Data Storage: Saves validated results to Apify Dataset
Cost Optimization
- Use
html-onlymode when possible to avoid Vision API costs - Set appropriate
maxVisionPagesandvisionTokenBudgetlimits - Vision API uses GPT-4o-mini for cost-effective extraction
- HTML content is truncated to 200KB to reduce token usage
Example Costs
OpenAI GPT-4o-mini pricing (as of 2024):
- Input: $0.15 per 1M tokens
- Output: $0.60 per 1M tokens
Typical page processing:
- HTML page: ~2,000-5,000 tokens
- Screenshot: ~1,500-3,000 tokens (vision tokens)
- Cost per page: ~$0.001-0.003
Billing & PPE Events
The Actor emits Platform Performance Events (PPE) for transparent billing tracking:
- Event:
extraction_succeeded - Metadata:
{ tokens: number, pages: number } - When: After each successful Vision API extraction
These events allow you to:
- Track exact token consumption per run
- Monitor costs in real-time via Apify Console
- Set up alerts for budget limits
- Analyze extraction efficiency
Example PPE event:
{"eventType": "extraction_succeeded","metadata": {"tokens": 3450,"pages": 1}}
Security
API Key Management
The Actor implements multiple security layers for API keys:
Log Sanitization: All logs automatically scrub sensitive data
- OpenAI API keys (
sk-***) - Apify tokens (
apify_api_***) - Bearer tokens (
Bearer ***) - Environment variables containing API keys
Secret Storage: Production API keys should be stored in Apify Secrets:
$apify secrets:add OPENAI_API_KEY "your-key-here"
Never commit API keys to version control or logs.
Webhook Security
The Actor includes webhook signature verification utilities in src/security.ts:
HMAC Signature Verification:
import { verifyWebhookSignature } from './security';const isValid = verifyWebhookSignature(payloadString,signatureHeader,webhookSecret);
Secret Sanitization:
import { sanitizeSecrets } from './security';const safeData = sanitizeSecrets(requestData);console.log(safeData);
Rate Limiting
The Actor respects rate limits and implements polite crawling:
- Default Delay: 1 second between pages
- Respects: robots.txt directives
- Configurable: Adjust delay in
src/main.ts
Rate limiting prevents:
- Server overload
- IP bans
- API throttling
- Violating Terms of Service
The built-in rateLimit() function ensures consistent delays between requests.
Technical Stack
- Runtime: Node.js 20+ with TypeScript
- Crawler: Crawlee 3.x (Cheerio + Playwright)
- Vision: OpenAI GPT-4o-mini
- Validation: AJV JSON Schema validator
- Python: 3.11 for Vision agent integration
Error Handling
- Continues processing remaining URLs even if individual pages fail
- Logs detailed error messages for debugging
- Validates output against schema with warnings for non-conforming data
- Tracks token usage and respects budget limits
Configuration Files
.env.example
Template for local environment variables:
APIFY_TOKEN=your_apify_token_hereOPENAI_API_KEY=your_openai_api_key_here
.actor/INPUT.json
Test input for local development with sample URLs and schema.
apify.json
Apify platform configuration - references the OpenAI API key from Apify secrets.
Limitations
- Vision API has rate limits and costs associated with usage
- Complex pages may require higher token budgets
- Screenshot size affects Vision API processing time
- Proxy usage requires Apify paid plan
Troubleshooting
"OpenAI API key not found"
Symptoms: Actor fails immediately with authentication error
Solution:
- Local Development: Create
.envfile withOPENAI_API_KEY=sk-... - Production: Store in Apify Secrets:
$apify secrets:add OPENAI_API_KEY "sk-..."
- Verification: Check
.actor/actor.jsonreferences@openai_api_key
"Rate limit exceeded"
Symptoms: 429 errors from OpenAI API
Solutions:
- Reduce
maxVisionPagesto limit concurrent requests - Increase rate limit delays in
src/main.ts:await rateLimit(2000); // 2 seconds - Upgrade OpenAI API tier for higher limits
- Implement exponential backoff for retries
"Token budget exceeded"
Symptoms: Actor stops processing pages mid-run
Solutions:
- Increase
visionTokenBudgetin input (default: 50000) - Switch to
html-onlymode for simple pages - Reduce
maxVisionPagesto process fewer pages - HTML is auto-truncated to 200KB to minimize tokens
"Schema validation failed"
Symptoms: Warning logs about schema mismatches
Solutions:
- Review
py/vision_agent.pyconfidence scores - Simplify schema - remove optional fields
- Add more descriptive field descriptions
- Check if required fields are too strict
- Validate schema using JSON Schema validator
"Python process exited with code 1"
Symptoms: Vision extraction fails with Python errors
Solutions:
- Check stderr logs for detailed error messages
- Verify Python 3.11 is available in Docker
- Ensure
py/requirements.txtdependencies are installed - Check HTML truncation isn't breaking JSON parsing
- Validate screenshot is valid PNG format
"API keys visible in logs"
Symptoms: Sensitive data appears in Actor logs
Solutions:
- This should never happen - the
sanitizeForLog()function scrubs all keys - If it does, report immediately as a security issue
- Rotate compromised API keys immediately
- Check custom logging doesn't bypass sanitization
Memory or Timeout Issues
Symptoms: Actor crashes or times out
Solutions:
- Reduce
maxVisionPagesto lower memory usage - Increase
requestHandlerTimeoutSecsfor slow pages - Use
html-onlymode to avoid browser overhead - Process URLs in smaller batches
- Upgrade Actor memory allocation in Apify Console
Resources
Support
For issues or questions:
License
Apache 2.0