Hybrid Vision Spider | AI-Powered Universal Web Scraper avatar
Hybrid Vision Spider | AI-Powered Universal Web Scraper

Pricing

from $13.00 / 1,000 results

Go to Apify Store
Hybrid Vision Spider | AI-Powered Universal Web Scraper

Hybrid Vision Spider | AI-Powered Universal Web Scraper

AI-driven hybrid web scraper that merges Playwright and Vision intelligence to extract structured data from any dynamic site. Schema-aware, proxy-ready, budget-safe, and fully compatible with Apify datasets.

Pricing

from $13.00 / 1,000 results

Rating

5.0

(2)

Developer

Țugui Dragoș

Țugui Dragoș

Maintained by Community

Actor stats

2

Bookmarked

7

Total users

3

Monthly active users

7 days ago

Last modified

Share

Hybrid Vision Spider is an advanced web scraper that combines traditional HTML parsing with AI-powered visual understanding to extract structured data from any webpage. Simply provide URLs and define what data you want using a JSON Schema - the Actor handles the rest.

Recommendation

For accurate and comprehensive data extraction, we recommend using hybrid or vision-only mode with an OpenAI API key. The html-only mode uses regex patterns and can only extract basic fields (title, links, email, phone, etc.), while Vision AI can understand page content and extract any structured data you define in your schema.

ModeBest ForAccuracyRequires API Key
html-onlyBasic data (title, links, emails)MediumNo
hybridMost use casesHighYes
vision-onlyComplex/visual dataHighestYes

Perfect for:

  • E-commerce product data extraction
  • News and blog article scraping
  • Contact information gathering
  • Price monitoring and comparison
  • Any structured data extraction task

Key Features

FeatureDescription
Hybrid ExtractionCombines fast HTML parsing with AI Vision for maximum accuracy
AI Vision (GPT-4)"Sees" the page like a human - extracts data from images, complex layouts, and dynamic content
JSON Schema OutputDefine exactly what data you want using standard JSON Schema
Three ModesChoose between html-only (fast), vision-only (accurate), or hybrid (balanced)
Smart HeuristicsAuto-detects emails, phone numbers, prices, and dates
DeduplicationAutomatically removes duplicate results
Proxy SupportBuilt-in Apify Proxy integration for anti-bot protection
Confidence ScoresKnow how reliable each extracted field is

How It Works

The Three Operating Modes

The Actor offers three extraction modes, each with different capabilities and requirements:

ModeAPI Key RequiredWhat It Can ExtractBest For
html-only❌ NoOnly: email, phone, price, url, dateSimple data, fast extraction, no AI costs
hybrid✅ Yes (OpenAI)Everything in your schemaBalanced speed/accuracy, cost-effective
vision-only✅ Yes (OpenAI)Everything in your schemaComplex layouts, images, dynamic content

What Each Mode Can Extract

HTML-Only Mode (Heuristics)

Uses regex patterns and smart heuristics to extract only these field types:

  • 📧 Email addresses - Detects email patterns in text and links
  • 📞 Phone numbers - Recognizes various phone formats
  • 💰 Prices - Extracts currency values and amounts
  • 🔗 URLs - Finds links and web addresses
  • 📅 Dates - Parses date formats

⚠️ Important: If your schema includes fields like productName, description, rating, etc., they will be empty in html-only mode because heuristics cannot extract arbitrary text content.

  1. First tries HTML heuristics for supported fields
  2. Then uses Vision AI to fill in missing/complex fields
  3. Combines results for maximum accuracy

Vision-Only Mode

  • Sends a screenshot to GPT-4 Vision
  • AI "sees" the page like a human
  • Can extract any data visible on the page
  • Best for complex layouts, images, or dynamic content

Why You Need an OpenAI API Key

Without API KeyWith API Key
Can only use html-only modeCan use all three modes
Limited to 5 field types (email, phone, price, url, date)Extract any data from your schema
Fast but limitedAI "sees" the page and understands context
Free (no AI costs)Pay-per-use OpenAI pricing

Get your API key: OpenAI Platform

Defining Your Schema Correctly

For best results, follow these guidelines:

{
"type": "object",
"properties": {
"productName": {
"type": "string",
"description": "The full product name as shown on the page" // ✅ Good: descriptive
},
"price": {
"type": "number",
"description": "Current price in USD" // ✅ Good: specific
},
"rating": {
"type": "number" // ⚠️ Missing description - AI may not know what to look for
}
},
"required": ["productName", "price"] // ✅ Mark essential fields as required
}

Tips:

  • ✅ Add description to every field - tells the AI exactly what to look for
  • ✅ Use required for essential fields - Actor will retry if these are missing
  • ✅ Use specific field names - productName is better than name
  • ✅ Match field names to heuristics - email, phone, price, url, date work in all modes

How to Use

Step 1: Add Your URLs

Enter the URLs you want to scrape. You can either:

  • Simple list: Paste URLs one per line in the "Start URLs" field
  • Advanced: Use the Request List editor for custom headers or methods

Step 2: Define Your Output Schema

Create a JSON Schema that describes the data you want to extract. For example:

{
"type": "object",
"properties": {
"title": { "type": "string", "description": "Product name" },
"price": { "type": "number", "description": "Price in USD" },
"description": { "type": "string", "description": "Product description" }
},
"required": ["title", "price"]
}

Step 3: Configure Settings

  • Choose your scraping mode (hybrid recommended for most cases)
  • Set limits to control costs (max results, vision pages, token budget)
  • Add your OpenAI API key if not using the default

Step 4: Run and Get Results

Click "Start" and wait for the Actor to finish. Your structured data will be available in the Dataset.

Input Configuration

URLs

FieldTypeDescription
Start URLs (simple list)TextPaste URLs one per line. The Actor normalizes and deduplicates automatically.
Advanced Request SourcesArrayFor advanced users: supports custom HTTP methods, headers, and userData.

Extraction Settings

FieldTypeDefaultDescription
Output SchemaJSONSee belowJSON Schema defining the data structure you want to extract. Required.
Scraping ModeSelecthybridhybrid = HTML first, Vision fallback; html-only = Fast, no AI; vision-only = Full AI extraction
Vision ModelSelectgpt-4o-miniOpenAI model: gpt-4o-mini (fast/cheap), gpt-4o (balanced), gpt-4-turbo (most capable)

API Key Configuration

⚠️ IMPORTANT: The openAiApiKey is REQUIRED for hybrid and vision-only modes!

FieldTypeRequiredDescription
OpenAI API KeySecretYes for hybrid/vision-onlyYour OpenAI API key (format: sk-...). Get one at platform.openai.com/api-keys

Mode Requirements:

ModeAPI KeyWhat Happens Without It
html-only❌ Not neededWorks normally, extracts only: email, phone, price, url, date
hybridRequiredWill fail - Cannot call Vision AI for missing fields
vision-onlyRequiredWill fail - Cannot process any pages

Limits & Budget

FieldTypeDefaultDescription
Max ResultsInteger100Maximum items to extract. Set to 0 for unlimited.
Max Vision API PagesInteger10Maximum pages to process with Vision API. Controls AI costs.
Vision Token BudgetInteger50,000Maximum tokens for all Vision API calls. Prevents runaway costs.

Proxy & Browser

FieldTypeDefaultDescription
Proxy ConfigurationObjectApify ResidentialConfigure proxy for anti-bot protection and geo-targeting.
Browser EngineSelectchromiumChoose between chromium or firefox.

Advanced

FieldTypeDescription
Webhook Callback URLURLOptional URL to receive progress updates (HTTPS recommended).

Output Format

Dataset Structure

Each extracted item in the Dataset contains:

{
"url": "https://example.com/product/123",
"method": "hybrid",
"data": {
"title": "Example Product",
"price": 99.99,
"description": "Product description..."
},
"confidence": {
"title": 0.95,
"price": 0.90,
"description": 0.85
},
"confidenceAverage": 0.90,
"missingFields": [],
"tokensUsed": 1250,
"timestamp": "2024-01-15T10:30:00.000Z"
}
FieldDescription
urlThe scraped page URL
methodExtraction method used: html-only, html-heuristic, vision, or vision-retry
dataYour extracted data matching the schema
confidencePer-field confidence scores (0-1)
confidenceAverageOverall extraction confidence
missingFieldsList of required fields that couldn't be extracted
tokensUsedOpenAI tokens consumed for this page
timestampISO 8601 extraction timestamp

Key-Value Store

The Actor also stores artifacts for debugging:

  • Screenshots: screenshot-{hash}.png - Full-page screenshots
  • HTML: html-{hash}.html - Raw HTML content
  • Stats: STATS - Run statistics (pages processed, tokens used, errors)

Examples

Example 1: News Article Extraction

Extract structured data from news articles:

Input:

{
"urlList": "https://www.bbc.com/news/article",
"mode": "hybrid",
"schema": {
"type": "object",
"properties": {
"headline": { "type": "string", "description": "The main headline of the article" },
"author": { "type": "string", "description": "The author or journalist name" },
"publishDate": { "type": "string", "description": "Publication date of the article" },
"summary": { "type": "string", "description": "Brief summary or lead paragraph" },
"category": { "type": "string", "description": "News category (e.g., Politics, Technology, Sports)" }
},
"required": ["headline", "publishDate"]
}
}

Expected Output:

{
"url": "https://www.bbc.com/news/article",
"method": "hybrid",
"data": {
"headline": "Breaking: Major Climate Agreement Reached at Summit",
"author": "Jane Smith",
"publishDate": "2024-12-05",
"summary": "World leaders have agreed on a landmark climate deal that aims to reduce global emissions by 50% by 2030.",
"category": "Environment"
},
"confidence": {
"headline": 0.98,
"author": 0.85,
"publishDate": 0.95,
"summary": 0.90,
"category": 0.88
},
"confidenceAverage": 0.91,
"missingFields": [],
"tokensUsed": 1150,
"timestamp": "2024-12-05T14:30:00.000Z"
}

Example 2: Company/Business Page

Extract company information from about pages:

Input:

{
"urlList": "https://example.com/about",
"mode": "vision-only",
"schema": {
"type": "object",
"properties": {
"companyName": { "type": "string", "description": "Official company name" },
"description": { "type": "string", "description": "Company description or mission statement" },
"services": {
"type": "array",
"items": { "type": "string" },
"description": "List of services or products offered"
},
"teamMembers": {
"type": "array",
"items": { "type": "string" },
"description": "Names of key team members or leadership"
},
"contactInfo": {
"type": "object",
"properties": {
"email": { "type": "string", "description": "Contact email" },
"phone": { "type": "string", "description": "Contact phone number" },
"address": { "type": "string", "description": "Physical address" }
},
"description": "Company contact information"
}
},
"required": ["companyName", "description"]
}
}

Expected Output:

{
"url": "https://example.com/about",
"method": "vision",
"data": {
"companyName": "TechVentures Inc.",
"description": "We are a leading technology consulting firm helping businesses transform through innovative digital solutions.",
"services": ["Cloud Migration", "AI Integration", "Custom Software Development", "Data Analytics"],
"teamMembers": ["John Doe - CEO", "Sarah Johnson - CTO", "Mike Chen - VP Engineering"],
"contactInfo": {
"email": "contact@techventures.com",
"phone": "+1 (555) 123-4567",
"address": "123 Innovation Drive, San Francisco, CA 94105"
}
},
"confidence": {
"companyName": 0.99,
"description": 0.92,
"services": 0.88,
"teamMembers": 0.85,
"contactInfo": 0.90
},
"confidenceAverage": 0.91,
"missingFields": [],
"tokensUsed": 1850,
"timestamp": "2024-12-05T15:45:00.000Z"
}

Example 3: Job Listing

Extract job posting details from career pages:

Input:

{
"urlList": "https://careers.example.com/job",
"mode": "hybrid",
"schema": {
"type": "object",
"properties": {
"jobTitle": { "type": "string", "description": "The job position title" },
"company": { "type": "string", "description": "Hiring company name" },
"location": { "type": "string", "description": "Job location (city, remote, hybrid)" },
"salary": { "type": "string", "description": "Salary range or compensation details" },
"requirements": {
"type": "array",
"items": { "type": "string" },
"description": "Required skills and qualifications"
},
"description": { "type": "string", "description": "Full job description and responsibilities" }
},
"required": ["jobTitle", "company", "location"]
}
}

Expected Output:

{
"url": "https://careers.example.com/job",
"method": "hybrid",
"data": {
"jobTitle": "Senior Software Engineer",
"company": "InnovateTech Solutions",
"location": "Remote (US-based)",
"salary": "$150,000 - $180,000 per year",
"requirements": [
"5+ years of experience in software development",
"Proficiency in Python, JavaScript, and cloud technologies",
"Experience with microservices architecture",
"Strong communication and collaboration skills",
"Bachelor's degree in Computer Science or equivalent"
],
"description": "We are seeking a Senior Software Engineer to join our growing team. You will be responsible for designing and implementing scalable backend systems, mentoring junior developers, and collaborating with cross-functional teams to deliver high-quality software solutions."
},
"confidence": {
"jobTitle": 0.98,
"company": 0.95,
"location": 0.92,
"salary": 0.88,
"requirements": 0.90,
"description": 0.93
},
"confidenceAverage": 0.93,
"missingFields": [],
"tokensUsed": 1420,
"timestamp": "2024-12-05T16:20:00.000Z"
}

Tip: Fields named email, phone, price, and date are automatically detected using smart heuristics, even in html-only mode!

Pricing & Cost Considerations

Apify Platform Costs

Standard Apify platform usage fees apply based on compute units consumed.

OpenAI API Costs (External)

This Actor uses the OpenAI API for Vision extraction. You are responsible for OpenAI API costs.

ModelInput CostOutput CostBest For
gpt-4o-mini$0.15/1M tokens$0.60/1M tokensMost use cases (recommended)
gpt-4o$2.50/1M tokens$10.00/1M tokensComplex extractions
gpt-4-turbo$10.00/1M tokens$30.00/1M tokensMaximum accuracy

Typical costs per page:

  • HTML-only mode: Free (no OpenAI calls)
  • Hybrid mode: $0.001 - $0.005 per page
  • Vision-only mode: $0.002 - $0.010 per page

Cost Control Tips

  1. Start with html-only mode for simple, static pages
  2. Use hybrid mode to minimize Vision API calls
  3. Set maxVisionPages to limit AI-processed pages
  4. Set visionTokenBudget to cap total token usage
  5. Use gpt-4o-mini (default) for cost-effective extraction

Limitations

  • OpenAI API Required: Vision modes require a valid OpenAI API key
  • Rate Limits: Subject to OpenAI API rate limits
  • Complex Pages: Very complex layouts may require higher token budgets
  • Dynamic Content: Some JavaScript-heavy sites may need vision-only mode
  • Proxy Costs: Using Apify Proxy incurs additional platform costs
  • No Link Following: The Actor processes only the provided URLs (no crawling)

Security & Compliance

  • API Keys: Your OpenAI API key is stored securely and never logged
  • Data Privacy: Extracted data is stored only in your Apify account
  • Compliance: You are responsible for ensuring your use complies with:
    • Target website Terms of Service
    • robots.txt directives
    • GDPR, CCPA, and other applicable regulations

Support

Need help? Here's how to get support:

Resources


Built with Apify SDK and OpenAI GPT-4 Vision