Ai Web Scraper - Extract Data With Ease avatar
Ai Web Scraper - Extract Data With Ease

Pricing

Pay per event

Go to Apify Store
Ai Web Scraper - Extract Data With Ease

Ai Web Scraper - Extract Data With Ease

Developed by

Paco

Paco

Maintained by Community

Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.

2.0 (1)

Pricing

Pay per event

19

623

89

Last modified

13 days ago

AI Web Scraper

This AI Web Scraper is a powerful and flexible tool that uses the power of Large Language Models (LLMs), specifically Google's Gemini, to intelligently extract data from web pages. Unlike traditional scrapers that rely on pre-defined selectors, this actor lets you specify what data you need in natural language, and it automatically adapts to extract it.

🚀 Performance Optimized: This version includes significant performance improvements including concurrent processing, intelligent content growth detection, memory optimization, and batch data processing for faster and more reliable scraping.

DISCLAIMER!! : The goal of this scraper is to make it possible for everyone (also non- techies) to scrape. It is still quite experimental since it relies on certain vision capabilities, meaning results can be sometimes inconsistent or not entirely what you'd expect.

What Does This Actor Do?

This actor automates the process of web scraping by combining browser automation with AI-powered element identification. Here’s a breakdown of its key capabilities:

  • Dynamic Data Extraction: You BRIEFLY specify the items you want to scrape separated by commas in natural language (e.g., "product name, product price") and the actor intelligently identifies and extracts those values from the web page.
  • Intelligent Element Identification: Leveraging Google's Gemini LLM, it analyzes web page screenshots to pinpoint the location of relevant elements and labels them, even if the website's structure is unfamiliar.
  • Performance Optimized Scrolling: Uses intelligent content growth detection instead of fixed delays, making scrolling faster and more reliable while respecting timeout safeguards.
  • Concurrent Processing: Processes multiple screenshots simultaneously with controlled concurrency to maximize speed while respecting API rate limits.
  • Memory Efficient: Automatically clears screenshots from memory after processing to prevent memory accumulation during large scraping jobs.
  • Batch Data Processing: Saves data in optimized batches instead of individual items, reducing API overhead and improving performance.
  • Smart Data Consolidation: Automatically removes duplicate items and keeps only the most complete data. Groups items by their most discriminative fields (e.g., product name) and selects the version with the most complete information, ensuring cleaner datasets.
  • Robust Error Handling: Includes comprehensive error handling, timeouts, and resource cleanup to ensure reliable operation.
  • Flexible Data Structure: The actor returns the data in a structured JSON format, with labels derived from the user's instructions or the bounding boxes provided by Gemini itself, making it easy to use in your own applications, reports, or spreadsheets.
  • Avoids Bot Detection: Takes several measures to avoid bot detection (using realistic user-agents and headless browser settings).

How to Use the AI Web Scraper

Using this actor is straightforward:

  1. Create an Apify Account: Start with a free Apify account using your email.
  2. Open the Smart Web Data Extractor: Go to the actor page.
  3. Provide Instructions and URLs: Input your desired instructions e.g., product name, product price and one or more target URLs.
  4. Run the Actor: Click the "Start" button and wait for the data to be extracted.
  5. Download Your Data: Retrieve the scraped data in JSON format.

Input

To start scraping data, the actor accepts the following input parameters:

Core Parameters

  • start_urls: An array of at least two URLs of the web pages you want to scrape
  • instructions: A list of items you wish to scrape, separated by commas. e.g., "product name, product price"

Performance Parameters (Optional)

  • max_concurrent_screenshots: Maximum number of screenshots to process simultaneously (default: 4)
  • screenshot_timeout: Timeout in seconds for each screenshot analysis (default: 60)

Data Quality Parameters (Optional)

  • enable_smart_consolidation: Automatically remove duplicate items and keep only the most complete data (default: true, highly recommended)

Scrolling Options

  • has_infinite_scroll: Enable intelligent infinite scrolling with content growth detection
  • above_fold_only: Only capture content visible without scrolling

Other Options

  • save_screenshots: Save screenshots to key-value store for debugging
  • device_type: Choose between "desktop" or "mobile" viewport simulation
  • mobile_device_model: Specific mobile device to emulate when using mobile device type

Here’s an example of an input configuration in JSON format:

{
"instructions": "Product name, Product price, SKU number, Product Dimensions",
"start_urls": [
"https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848",
"https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036",
"https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html"
],
"max_concurrent_screenshots": 6,
"screenshot_timeout": 60,
"has_infinite_scroll": false,
"above_fold_only": false,
"device_type": "desktop",
"enable_smart_consolidation": true
}

Output

The output from this Actor is stored in a dataset. You can view this data in the Apify UI or download it in JSON, CSV or other formats. Here is the example corresponding to the input example provided above:

{
"url": "https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848",
"data": {
"Product name": "Metal Wall Hanging Of Lord Ganesha- Divinity And Elegance",
"Product price": "₹ 470.00 /piece",
"SKU number": "0",
"Product Dimensions": "0"
}
},
{
"url": "https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036",
"data": {
"Product name": "Circular Yellow Bag With Floral Print And Elephant Design",
"Product price": "₹ 350.00 /piece",
"SKU number": "0",
"Product Dimensions": "0"
}
},
{
"url": "https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html",
"data": {
"Product name": "1-Fase rail connector - I-vorm - Zwart",
"Product price": "€ 1,95",
"SKU number": "PLX849640"
}
}

How Can I Use the Data Extracted with AI Web Scraper?

  • Market Research: Extract product information, pricing, and customer reviews for competitive analysis.
  • Content Aggregation: Collect data for news aggregation, research, or blog content.
  • Financial Analysis: Gather financial metrics and performance data from various financial websites.
  • E-commerce Intelligence: Extract and monitor product and pricing information from online stores.
  • Lead Generation: Collect relevant information for potential business opportunities.

How Does the AI Web Scraper Work?

The AI Web Scraper combines advanced browser automation with Google's Gemini LLM to offer a cutting-edge solution for web scraping. This actor operates in multiple stages to ensure efficient, accurate, and flexible data extraction.

1. Input Configuration

  • User Instructions: The scraper accepts natural language instructions describing the data to extract, such as "product name, price, and dimensions."
  • Start URLs: A list of URLs serves as the input target for scraping.
  • Performance Parameters: Configurable concurrency limits (max_concurrent_screenshots) and timeout settings (screenshot_timeout) for optimal performance.

2. AI-Powered Screenshot Analysis

  • Direct Screenshot Processing: The scraper captures screenshots of web pages and uses Google's Gemini 2.0 Flash Lite model to directly analyze visual content and extract data.
  • Dynamic Schema Generation: Creates a dynamic Pydantic schema from user instructions using LLM processing to ensure structured data extraction.
  • No Selector Dependencies: Unlike traditional scrapers, this approach doesn't rely on CSS selectors or DOM parsing, making it resilient to website changes.

3. Intelligent Content Growth Detection

  • Event-Driven Scrolling: Waits for actual content growth after scrolling operations instead of using fixed delays.
  • Timeout Safeguards: Uses configurable timeouts (max 3000ms for infinite scroll, 1500ms for regular scrolling) to prevent hanging.
  • Smart Scroll Detection: Monitors both scroll position changes and content height growth to determine when to continue or stop scrolling.

4. Performance-Optimized Processing

  • Controlled Concurrency: Uses semaphore-based concurrency control to process multiple screenshots simultaneously while respecting API rate limits.
  • Single Page Reuse: Reuses a single browser page instance across all URLs to reduce resource overhead.
  • Memory Management: Automatically clears screenshots from memory after processing to prevent accumulation during large jobs.

5. Smart Data Consolidation

  • Duplicate Detection: Automatically identifies and removes duplicate items using discriminative field analysis.
  • Data Completion: Keeps the most complete version of each item when duplicates are found (e.g., retains records with more filled fields).
  • Intelligent Grouping: Groups items by their most identifying characteristics (typically product names or titles) for accurate deduplication.

6. Batch Data Processing

  • Optimized Saves: Processes data in batches of 100 items to reduce API overhead and improve performance.
  • Fallback Handling: Includes individual item fallback if batch processing fails, ensuring no data loss.
  • Structured Output: Each item is wrapped in a consistent JSON structure for easy integration.

7. Resource Management

  • Comprehensive Cleanup: Ensures proper cleanup of browser instances, contexts, and pages in all scenarios.
  • Error Recovery: Continues processing even if individual screenshots or URLs fail, maximizing data extraction success.
  • Timeout Management: Implements configurable timeouts at multiple levels (page load, screenshot processing, LLM analysis).

Key Features:

  • AI-Driven Flexibility: Direct visual analysis eliminates the need for predefined selectors or DOM knowledge.
  • Performance Optimized: Concurrent processing, intelligent scrolling, memory management, and batch operations for maximum efficiency.
  • Robust Error Handling: Comprehensive timeout handling, resource cleanup, and graceful error recovery at every level.
  • Smart Data Quality: Automatic deduplication and data consolidation for cleaner, higher-quality datasets.
  • Memory Efficient: Proactive cleanup of screenshots and browser resources to prevent memory accumulation.
  • Configurable Performance: Adjustable concurrency limits and timeouts to balance speed with API rate limits and reliability.
  • Bot Detection Avoidance: Uses realistic user agents and optimized browser settings to minimize detection.
  • Scalable Architecture: Handles large scraping jobs efficiently through batch processing and resource optimization.

Advantages of This Approach:

  • User-Friendly: Simple natural language instructions make it accessible to non-technical users.
  • Highly Adaptable: Visual analysis approach works with any website structure, including dynamic and complex layouts.
  • Performance Optimized: Multiple optimization layers ensure fast, efficient processing even for large-scale scraping jobs.
  • Reliable: Multi-level error handling and timeout management ensure consistent operation across diverse websites.
  • Data Quality Focused: Built-in deduplication and consolidation produce cleaner, more useful datasets automatically.

Smart Data Consolidation

The AI Web Scraper includes an intelligent data consolidation system that automatically improves data quality by removing duplicates and keeping only the most complete information.

How It Works

When scraping pages with multiple screenshots (especially with infinite scroll), the same items often appear multiple times with varying levels of completeness. For example:

Before Consolidation:

[
{"title": "MacBook Pro", "price": "$1,999", "discount_price": ""},
{"title": "MacBook Pro", "price": "$1,999", "discount_price": "$1,799"},
{"title": "MacBook Pro", "price": "", "rating": "4.8"}
]

After Smart Consolidation:

[
{"title": "MacBook Pro", "price": "$1,999", "discount_price": "$1,799", "rating": "4.8"}
]

Key Benefits

  • Eliminates Duplicates: Automatically groups items that refer to the same entity
  • Maximizes Completeness: Keeps the version with the most complete information
  • Schema Agnostic: Works with any field combination you specify in your instructions
  • Zero Configuration: No setup required - it automatically adapts to your data structure
  • Cost Effective: Pure algorithmic approach with no additional API calls

Technical Details

The system uses advanced data distribution analysis to:

  1. Identify discriminative fields (e.g., product names, titles) for grouping
  2. Group similar items using the most reliable identifying characteristics
  3. Select the most complete version from each group based on non-empty field count
  4. Merge consistent values when all items in a group agree on a field value

This feature is enabled by default and highly recommended for cleaner, higher-quality datasets. You can disable it by setting enable_smart_consolidation: false in your input if needed.

Integrations

This Actor integrates with other Apify platform components and other external services:

  • Webhooks: Automatically notify you when the scraping is complete or send the data to another application.
  • API: Control the Actor programmatically using the Apify API.
  • Cloud Services: Use Apify integrations to automatically store the data in services like Google Sheets, Google Drive, Slack, and others.

Scrape Any Web Data You Need with This Dynamic Scraper

This AI Web Scraper is your one-stop solution for scraping any data you need. Whether it's a product name, a price, a news headline, or a financial metric, this actor adapts to extract it by analyzing the context of your instructions.

Not What You Need? Build Your Own!

If this actor doesn't exactly meet your needs, you can use one of the scraper templates available in Python, JavaScript, and TypeScript to get started or check out our open-source library Crawlee.

You can also request a custom scraping solution from us.

Your Feedback

Your feedback is valuable to us. If you have any suggestions or find a bug, please create an issue on the Actor's Issues tab in the Apify Console.

FAQ

How much does AI Web Scraper cost?

This actor uses Apify's Pay-per-event pricing model. Apify also provides you with free monthly usage credits.

How can I use AI Web Scraper with the Apify API?

You can access the Apify API programmatically via RESTful HTTP endpoints or SDKs (apify-client NPM package for JavaScript, apify-client PyPI package for Python) to run, manage, and get the data out of any actor.

This actor only extracts data that is publicly available. Please ensure that you comply with the terms and conditions of websites you scrape, and you are responsible for ensuring your compliance with data privacy regulations such as GDPR.