Hybrid Vision Spider | AI-Powered Universal Web Scraper
Pricing
from $13.00 / 1,000 results
Hybrid Vision Spider | AI-Powered Universal Web Scraper
AI-driven hybrid web scraper that merges Playwright and Vision intelligence to extract structured data from any dynamic site. Schema-aware, proxy-ready, budget-safe, and fully compatible with Apify datasets.
Pricing
from $13.00 / 1,000 results
Rating
5.0
(2)
Developer

Țugui Dragoș
Actor stats
2
Bookmarked
7
Total users
3
Monthly active users
7 days ago
Last modified
Categories
Share
Hybrid Vision Spider is an advanced web scraper that combines traditional HTML parsing with AI-powered visual understanding to extract structured data from any webpage. Simply provide URLs and define what data you want using a JSON Schema - the Actor handles the rest.
Recommendation
For accurate and comprehensive data extraction, we recommend using hybrid or vision-only mode with an OpenAI API key. The html-only mode uses regex patterns and can only extract basic fields (title, links, email, phone, etc.), while Vision AI can understand page content and extract any structured data you define in your schema.
| Mode | Best For | Accuracy | Requires API Key |
|---|---|---|---|
| html-only | Basic data (title, links, emails) | Medium | No |
| hybrid | Most use cases | High | Yes |
| vision-only | Complex/visual data | Highest | Yes |
Perfect for:
- E-commerce product data extraction
- News and blog article scraping
- Contact information gathering
- Price monitoring and comparison
- Any structured data extraction task
Key Features
| Feature | Description |
|---|---|
| Hybrid Extraction | Combines fast HTML parsing with AI Vision for maximum accuracy |
| AI Vision (GPT-4) | "Sees" the page like a human - extracts data from images, complex layouts, and dynamic content |
| JSON Schema Output | Define exactly what data you want using standard JSON Schema |
| Three Modes | Choose between html-only (fast), vision-only (accurate), or hybrid (balanced) |
| Smart Heuristics | Auto-detects emails, phone numbers, prices, and dates |
| Deduplication | Automatically removes duplicate results |
| Proxy Support | Built-in Apify Proxy integration for anti-bot protection |
| Confidence Scores | Know how reliable each extracted field is |
How It Works
The Three Operating Modes
The Actor offers three extraction modes, each with different capabilities and requirements:
| Mode | API Key Required | What It Can Extract | Best For |
|---|---|---|---|
html-only | ❌ No | Only: email, phone, price, url, date | Simple data, fast extraction, no AI costs |
hybrid | ✅ Yes (OpenAI) | Everything in your schema | Balanced speed/accuracy, cost-effective |
vision-only | ✅ Yes (OpenAI) | Everything in your schema | Complex layouts, images, dynamic content |
What Each Mode Can Extract
HTML-Only Mode (Heuristics)
Uses regex patterns and smart heuristics to extract only these field types:
- 📧 Email addresses - Detects email patterns in text and links
- 📞 Phone numbers - Recognizes various phone formats
- 💰 Prices - Extracts currency values and amounts
- 🔗 URLs - Finds links and web addresses
- 📅 Dates - Parses date formats
⚠️ Important: If your schema includes fields like
productName,description,rating, etc., they will be empty inhtml-onlymode because heuristics cannot extract arbitrary text content.
Hybrid Mode (Recommended)
- First tries HTML heuristics for supported fields
- Then uses Vision AI to fill in missing/complex fields
- Combines results for maximum accuracy
Vision-Only Mode
- Sends a screenshot to GPT-4 Vision
- AI "sees" the page like a human
- Can extract any data visible on the page
- Best for complex layouts, images, or dynamic content
Why You Need an OpenAI API Key
| Without API Key | With API Key |
|---|---|
Can only use html-only mode | Can use all three modes |
| Limited to 5 field types (email, phone, price, url, date) | Extract any data from your schema |
| Fast but limited | AI "sees" the page and understands context |
| Free (no AI costs) | Pay-per-use OpenAI pricing |
Get your API key: OpenAI Platform
Defining Your Schema Correctly
For best results, follow these guidelines:
{"type": "object","properties": {"productName": {"type": "string","description": "The full product name as shown on the page" // ✅ Good: descriptive},"price": {"type": "number","description": "Current price in USD" // ✅ Good: specific},"rating": {"type": "number" // ⚠️ Missing description - AI may not know what to look for}},"required": ["productName", "price"] // ✅ Mark essential fields as required}
Tips:
- ✅ Add
descriptionto every field - tells the AI exactly what to look for - ✅ Use
requiredfor essential fields - Actor will retry if these are missing - ✅ Use specific field names -
productNameis better thanname - ✅ Match field names to heuristics -
email,phone,price,url,datework in all modes
How to Use
Step 1: Add Your URLs
Enter the URLs you want to scrape. You can either:
- Simple list: Paste URLs one per line in the "Start URLs" field
- Advanced: Use the Request List editor for custom headers or methods
Step 2: Define Your Output Schema
Create a JSON Schema that describes the data you want to extract. For example:
{"type": "object","properties": {"title": { "type": "string", "description": "Product name" },"price": { "type": "number", "description": "Price in USD" },"description": { "type": "string", "description": "Product description" }},"required": ["title", "price"]}
Step 3: Configure Settings
- Choose your scraping mode (hybrid recommended for most cases)
- Set limits to control costs (max results, vision pages, token budget)
- Add your OpenAI API key if not using the default
Step 4: Run and Get Results
Click "Start" and wait for the Actor to finish. Your structured data will be available in the Dataset.
Input Configuration
URLs
| Field | Type | Description |
|---|---|---|
| Start URLs (simple list) | Text | Paste URLs one per line. The Actor normalizes and deduplicates automatically. |
| Advanced Request Sources | Array | For advanced users: supports custom HTTP methods, headers, and userData. |
Extraction Settings
| Field | Type | Default | Description |
|---|---|---|---|
| Output Schema | JSON | See below | JSON Schema defining the data structure you want to extract. Required. |
| Scraping Mode | Select | hybrid | hybrid = HTML first, Vision fallback; html-only = Fast, no AI; vision-only = Full AI extraction |
| Vision Model | Select | gpt-4o-mini | OpenAI model: gpt-4o-mini (fast/cheap), gpt-4o (balanced), gpt-4-turbo (most capable) |
API Key Configuration
⚠️ IMPORTANT: The
openAiApiKeyis REQUIRED forhybridandvision-onlymodes!
| Field | Type | Required | Description |
|---|---|---|---|
| OpenAI API Key | Secret | Yes for hybrid/vision-only | Your OpenAI API key (format: sk-...). Get one at platform.openai.com/api-keys |
Mode Requirements:
| Mode | API Key | What Happens Without It |
|---|---|---|
html-only | ❌ Not needed | Works normally, extracts only: email, phone, price, url, date |
hybrid | ✅ Required | ❌ Will fail - Cannot call Vision AI for missing fields |
vision-only | ✅ Required | ❌ Will fail - Cannot process any pages |
Limits & Budget
| Field | Type | Default | Description |
|---|---|---|---|
| Max Results | Integer | 100 | Maximum items to extract. Set to 0 for unlimited. |
| Max Vision API Pages | Integer | 10 | Maximum pages to process with Vision API. Controls AI costs. |
| Vision Token Budget | Integer | 50,000 | Maximum tokens for all Vision API calls. Prevents runaway costs. |
Proxy & Browser
| Field | Type | Default | Description |
|---|---|---|---|
| Proxy Configuration | Object | Apify Residential | Configure proxy for anti-bot protection and geo-targeting. |
| Browser Engine | Select | chromium | Choose between chromium or firefox. |
Advanced
| Field | Type | Description |
|---|---|---|
| Webhook Callback URL | URL | Optional URL to receive progress updates (HTTPS recommended). |
Output Format
Dataset Structure
Each extracted item in the Dataset contains:
{"url": "https://example.com/product/123","method": "hybrid","data": {"title": "Example Product","price": 99.99,"description": "Product description..."},"confidence": {"title": 0.95,"price": 0.90,"description": 0.85},"confidenceAverage": 0.90,"missingFields": [],"tokensUsed": 1250,"timestamp": "2024-01-15T10:30:00.000Z"}
| Field | Description |
|---|---|
url | The scraped page URL |
method | Extraction method used: html-only, html-heuristic, vision, or vision-retry |
data | Your extracted data matching the schema |
confidence | Per-field confidence scores (0-1) |
confidenceAverage | Overall extraction confidence |
missingFields | List of required fields that couldn't be extracted |
tokensUsed | OpenAI tokens consumed for this page |
timestamp | ISO 8601 extraction timestamp |
Key-Value Store
The Actor also stores artifacts for debugging:
- Screenshots:
screenshot-{hash}.png- Full-page screenshots - HTML:
html-{hash}.html- Raw HTML content - Stats:
STATS- Run statistics (pages processed, tokens used, errors)
Examples
Example 1: News Article Extraction
Extract structured data from news articles:
Input:
{"urlList": "https://www.bbc.com/news/article","mode": "hybrid","schema": {"type": "object","properties": {"headline": { "type": "string", "description": "The main headline of the article" },"author": { "type": "string", "description": "The author or journalist name" },"publishDate": { "type": "string", "description": "Publication date of the article" },"summary": { "type": "string", "description": "Brief summary or lead paragraph" },"category": { "type": "string", "description": "News category (e.g., Politics, Technology, Sports)" }},"required": ["headline", "publishDate"]}}
Expected Output:
{"url": "https://www.bbc.com/news/article","method": "hybrid","data": {"headline": "Breaking: Major Climate Agreement Reached at Summit","author": "Jane Smith","publishDate": "2024-12-05","summary": "World leaders have agreed on a landmark climate deal that aims to reduce global emissions by 50% by 2030.","category": "Environment"},"confidence": {"headline": 0.98,"author": 0.85,"publishDate": 0.95,"summary": 0.90,"category": 0.88},"confidenceAverage": 0.91,"missingFields": [],"tokensUsed": 1150,"timestamp": "2024-12-05T14:30:00.000Z"}
Example 2: Company/Business Page
Extract company information from about pages:
Input:
{"urlList": "https://example.com/about","mode": "vision-only","schema": {"type": "object","properties": {"companyName": { "type": "string", "description": "Official company name" },"description": { "type": "string", "description": "Company description or mission statement" },"services": {"type": "array","items": { "type": "string" },"description": "List of services or products offered"},"teamMembers": {"type": "array","items": { "type": "string" },"description": "Names of key team members or leadership"},"contactInfo": {"type": "object","properties": {"email": { "type": "string", "description": "Contact email" },"phone": { "type": "string", "description": "Contact phone number" },"address": { "type": "string", "description": "Physical address" }},"description": "Company contact information"}},"required": ["companyName", "description"]}}
Expected Output:
{"url": "https://example.com/about","method": "vision","data": {"companyName": "TechVentures Inc.","description": "We are a leading technology consulting firm helping businesses transform through innovative digital solutions.","services": ["Cloud Migration", "AI Integration", "Custom Software Development", "Data Analytics"],"teamMembers": ["John Doe - CEO", "Sarah Johnson - CTO", "Mike Chen - VP Engineering"],"contactInfo": {"email": "contact@techventures.com","phone": "+1 (555) 123-4567","address": "123 Innovation Drive, San Francisco, CA 94105"}},"confidence": {"companyName": 0.99,"description": 0.92,"services": 0.88,"teamMembers": 0.85,"contactInfo": 0.90},"confidenceAverage": 0.91,"missingFields": [],"tokensUsed": 1850,"timestamp": "2024-12-05T15:45:00.000Z"}
Example 3: Job Listing
Extract job posting details from career pages:
Input:
{"urlList": "https://careers.example.com/job","mode": "hybrid","schema": {"type": "object","properties": {"jobTitle": { "type": "string", "description": "The job position title" },"company": { "type": "string", "description": "Hiring company name" },"location": { "type": "string", "description": "Job location (city, remote, hybrid)" },"salary": { "type": "string", "description": "Salary range or compensation details" },"requirements": {"type": "array","items": { "type": "string" },"description": "Required skills and qualifications"},"description": { "type": "string", "description": "Full job description and responsibilities" }},"required": ["jobTitle", "company", "location"]}}
Expected Output:
{"url": "https://careers.example.com/job","method": "hybrid","data": {"jobTitle": "Senior Software Engineer","company": "InnovateTech Solutions","location": "Remote (US-based)","salary": "$150,000 - $180,000 per year","requirements": ["5+ years of experience in software development","Proficiency in Python, JavaScript, and cloud technologies","Experience with microservices architecture","Strong communication and collaboration skills","Bachelor's degree in Computer Science or equivalent"],"description": "We are seeking a Senior Software Engineer to join our growing team. You will be responsible for designing and implementing scalable backend systems, mentoring junior developers, and collaborating with cross-functional teams to deliver high-quality software solutions."},"confidence": {"jobTitle": 0.98,"company": 0.95,"location": 0.92,"salary": 0.88,"requirements": 0.90,"description": 0.93},"confidenceAverage": 0.93,"missingFields": [],"tokensUsed": 1420,"timestamp": "2024-12-05T16:20:00.000Z"}
Tip: Fields named
phone,price, anddateare automatically detected using smart heuristics, even inhtml-onlymode!
Pricing & Cost Considerations
Apify Platform Costs
Standard Apify platform usage fees apply based on compute units consumed.
OpenAI API Costs (External)
This Actor uses the OpenAI API for Vision extraction. You are responsible for OpenAI API costs.
| Model | Input Cost | Output Cost | Best For |
|---|---|---|---|
gpt-4o-mini | $0.15/1M tokens | $0.60/1M tokens | Most use cases (recommended) |
gpt-4o | $2.50/1M tokens | $10.00/1M tokens | Complex extractions |
gpt-4-turbo | $10.00/1M tokens | $30.00/1M tokens | Maximum accuracy |
Typical costs per page:
- HTML-only mode: Free (no OpenAI calls)
- Hybrid mode: $0.001 - $0.005 per page
- Vision-only mode: $0.002 - $0.010 per page
Cost Control Tips
- Start with
html-onlymode for simple, static pages - Use
hybridmode to minimize Vision API calls - Set
maxVisionPagesto limit AI-processed pages - Set
visionTokenBudgetto cap total token usage - Use
gpt-4o-mini(default) for cost-effective extraction
Limitations
- OpenAI API Required: Vision modes require a valid OpenAI API key
- Rate Limits: Subject to OpenAI API rate limits
- Complex Pages: Very complex layouts may require higher token budgets
- Dynamic Content: Some JavaScript-heavy sites may need
vision-onlymode - Proxy Costs: Using Apify Proxy incurs additional platform costs
- No Link Following: The Actor processes only the provided URLs (no crawling)
Security & Compliance
- API Keys: Your OpenAI API key is stored securely and never logged
- Data Privacy: Extracted data is stored only in your Apify account
- Compliance: You are responsible for ensuring your use complies with:
- Target website Terms of Service
- robots.txt directives
- GDPR, CCPA, and other applicable regulations
Support
Need help? Here's how to get support:
- Documentation: Check the Apify Documentation
- Discord: Join the Apify Discord Community
- Forum: Ask questions on the Apify Community Forum
- Issues: Report bugs through Apify Console support
Resources
- JSON Schema Guide - Learn how to write extraction schemas
- OpenAI Vision API - Understand Vision capabilities
- Apify Proxy - Configure proxy settings
Built with Apify SDK and OpenAI GPT-4 Vision
