Ai Web Scraper - Extract Data With Ease
Pricing
Pay per event
Ai Web Scraper - Extract Data With Ease
Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.
Pricing
Pay per event
Rating
5.0
(2)
Developer

Paco
Actor stats
24
Bookmarked
865
Total users
42
Monthly active users
2 days ago
Last modified
Categories
Share
AI Web Scraper
This AI Web Scraper is a powerful and flexible tool that uses the power of Large Language Models (LLMs), specifically Google's Gemini, to intelligently extract data from web pages. Unlike traditional scrapers that rely on pre-defined selectors, this actor lets you specify what data you need in natural language, and it automatically adapts to extract it.
🚀 Performance Optimized: This version includes significant performance improvements including concurrent processing, intelligent content growth detection, memory optimization, and batch data processing for faster and more reliable scraping.
DISCLAIMER!! : The goal of this scraper is to make it possible for everyone (also non- techies) to scrape. It is still quite experimental since it relies on certain vision capabilities, meaning results can be sometimes inconsistent or not entirely what you'd expect.
What Does This Actor Do?
This actor automates the process of web scraping by combining browser automation with AI-powered element identification. Here’s a breakdown of its key capabilities:
- Dynamic Data Extraction: You BRIEFLY specify the items you want to scrape separated by commas in natural language (e.g., "product name, product price") and the actor intelligently identifies and extracts those values from the web page.
- Intelligent Element Identification: Leveraging Google's Gemini LLM, it analyzes web page screenshots to pinpoint the location of relevant elements and labels them, even if the website's structure is unfamiliar.
- Performance Optimized Scrolling: Uses intelligent content growth detection instead of fixed delays, making scrolling faster and more reliable while respecting timeout safeguards.
- Concurrent Processing: Processes multiple screenshots simultaneously with controlled concurrency to maximize speed while respecting API rate limits.
- Memory Efficient: Automatically clears screenshots from memory after processing to prevent memory accumulation during large scraping jobs.
- Batch Data Processing: Saves data in optimized batches instead of individual items, reducing API overhead and improving performance.
- Smart Data Consolidation: Automatically removes duplicate items and keeps only the most complete data. Groups items by their most discriminative fields (e.g., product name) and selects the version with the most complete information, ensuring cleaner datasets.
- Robust Error Handling: Includes comprehensive error handling, timeouts, and resource cleanup to ensure reliable operation.
- Flexible Data Structure: The actor returns the data in a structured JSON format, with labels derived from the user's instructions or the bounding boxes provided by Gemini itself, making it easy to use in your own applications, reports, or spreadsheets.
- Avoids Bot Detection: Takes several measures to avoid bot detection (using realistic user-agents and headless browser settings).
How to Use the AI Web Scraper
Using this actor is straightforward:
- Create an Apify Account: Start with a free Apify account using your email.
- Open the Smart Web Data Extractor: Go to the actor page.
- Provide Instructions and URLs: Input your desired instructions e.g., product name, product price and one or more target URLs.
- Run the Actor: Click the "Start" button and wait for the data to be extracted.
- Download Your Data: Retrieve the scraped data in JSON format.
Input
To start scraping data, the actor accepts the following input parameters:
Core Parameters
start_urls: An array of at least two URLs of the web pages you want to scrapeinstructions: A list of items you wish to scrape, separated by commas. e.g., "product name, product price"
Performance Parameters (Optional)
max_concurrent_screenshots: Maximum number of screenshots to process simultaneously (default: 4)screenshot_timeout: Timeout in seconds for each screenshot analysis (default: 60)
Data Quality Parameters (Optional)
enable_smart_consolidation: Automatically remove duplicate items and keep only the most complete data (default: true, highly recommended)
Scrolling Options
has_infinite_scroll: Enable intelligent infinite scrolling with content growth detectionabove_fold_only: Only capture content visible without scrolling
Other Options
save_screenshots: Save screenshots to key-value store for debuggingdevice_type: Choose between "desktop" or "mobile" viewport simulationmobile_device_model: Specific mobile device to emulate when using mobile device type
Here’s an example of an input configuration in JSON format:
{"instructions": "Product name, Product price, SKU number, Product Dimensions","start_urls": ["https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848","https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036","https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html"],"max_concurrent_screenshots": 6,"screenshot_timeout": 60,"has_infinite_scroll": false,"above_fold_only": false,"device_type": "desktop","enable_smart_consolidation": true}
Output
The output from this Actor is stored in a dataset. You can view this data in the Apify UI or download it in JSON, CSV or other formats. Here is the example corresponding to the input example provided above:
{"url": "https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848","data": {"Product name": "Metal Wall Hanging Of Lord Ganesha- Divinity And Elegance","Product price": "₹ 470.00 /piece","SKU number": "0","Product Dimensions": "0"}},{"url": "https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036","data": {"Product name": "Circular Yellow Bag With Floral Print And Elephant Design","Product price": "₹ 350.00 /piece","SKU number": "0","Product Dimensions": "0"}},{"url": "https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html","data": {"Product name": "1-Fase rail connector - I-vorm - Zwart","Product price": "€ 1,95","SKU number": "PLX849640"}}
How Can I Use the Data Extracted with AI Web Scraper?
- Market Research: Extract product information, pricing, and customer reviews for competitive analysis.
- Content Aggregation: Collect data for news aggregation, research, or blog content.
- Financial Analysis: Gather financial metrics and performance data from various financial websites.
- E-commerce Intelligence: Extract and monitor product and pricing information from online stores.
- Lead Generation: Collect relevant information for potential business opportunities.
How Does the AI Web Scraper Work?
The AI Web Scraper combines advanced browser automation with Google's Gemini LLM to offer a cutting-edge solution for web scraping. This actor operates in multiple stages to ensure efficient, accurate, and flexible data extraction.
1. Input Configuration
- User Instructions: The scraper accepts natural language instructions describing the data to extract, such as "product name, price, and dimensions."
- Start URLs: A list of URLs serves as the input target for scraping.
2. AI-Powered Element Detection
- Screenshot Analysis: When provided multiple URLs of the same domain, only the first page is used to detect the elements. Consecutive URLs use CSS Selectors, preventing high LLM costs. The scraper takes a screenshot of the web page and uses the Gemini LLM to identify bounding boxes around relevant elements. This enables dynamic and adaptive data extraction without requiring hardcoded selectors.
- Bounding Box Parsing: The bounding box coordinates returned by the AI model are mapped to the DOM structure of the web page.
3. Intelligent Content Growth Detection
- Instead of using fixed delays, the scraper waits for actual content growth after scrolling operations.
- Uses event-driven waits with timeout safeguards (max 3 seconds) to ensure optimal performance.
- Detects when pages have finished loading new content before proceeding with screenshot capture.
4. Selector Extraction
- For each matched DOM element, the scraper generates a CSS selector and extracts its corresponding HTML content. This enables robust and reusable data extraction across similar pages.
5. Data Extraction
- The scraper retrieves text or attributes (e.g., innerHTML, text) from elements identified by the selectors, ensuring the collected data aligns precisely with user instructions.
6. Output Formatting
- Extracted data is structured in JSON format. Each record contains the URL of the scraped page and the extracted data, making it easy to integrate into various applications.
7. Domain Optimization
- To enhance efficiency, the scraper caches selectors for domains it has processed. Subsequent pages from the same domain reuse these selectors, reducing the need for repeated AI analysis.
Key Features:
- AI-Driven Flexibility: Removes the need for predefined selectors by leveraging AI to dynamically identify elements.
- Performance Optimized: Concurrent processing, intelligent scrolling, and memory management for faster operation.
- Robust Error Handling: Comprehensive timeout handling, resource cleanup, and graceful error recovery.
- Batch Processing: Efficient data saving in batches to reduce API overhead and improve performance.
- Memory Efficient: Automatic cleanup of screenshots and resources to prevent memory accumulation.
- Configurable Concurrency: Adjustable concurrent processing limits to balance speed and API rate limits.
- Bot Detection Avoidance: Implements strategies like custom user agents and headless browsing to minimize detection.
- Structured Outputs: Outputs are clean and easy to use in JSON, CSV, or other formats.
Advantages of This Approach:
- User-Friendly: Natural language instructions make it accessible to non-technical users.
- Highly Adaptable: Capable of handling unknown or dynamically loaded page structures.
- Performance Optimized: Concurrent processing, intelligent waits, and memory management for faster operation.
- Reliable: Comprehensive error handling and timeout management ensure consistent operation.
- Scalable: Configurable concurrency and batch processing handle large scraping jobs efficiently.
Smart Data Consolidation
The AI Web Scraper includes an intelligent data consolidation system that automatically improves data quality by removing duplicates and keeping only the most complete information.
How It Works
When scraping pages with multiple screenshots (especially with infinite scroll), the same items often appear multiple times with varying levels of completeness. For example:
Before Consolidation:
[{"title": "MacBook Pro", "price": "$1,999", "discount_price": ""},{"title": "MacBook Pro", "price": "$1,999", "discount_price": "$1,799"},{"title": "MacBook Pro", "price": "", "rating": "4.8"}]
After Smart Consolidation:
[{"title": "MacBook Pro", "price": "$1,999", "discount_price": "$1,799", "rating": "4.8"}]
Key Benefits
- Eliminates Duplicates: Automatically groups items that refer to the same entity
- Maximizes Completeness: Keeps the version with the most complete information
- Schema Agnostic: Works with any field combination you specify in your instructions
- Zero Configuration: No setup required - it automatically adapts to your data structure
- Cost Effective: Pure algorithmic approach with no additional API calls
Technical Details
The system uses advanced data distribution analysis to:
- Identify discriminative fields (e.g., product names, titles) for grouping
- Group similar items using the most reliable identifying characteristics
- Select the most complete version from each group based on non-empty field count
- Merge consistent values when all items in a group agree on a field value
This feature is enabled by default and highly recommended for cleaner, higher-quality datasets. You can disable it by setting enable_smart_consolidation: false in your input if needed.
Integrations
This Actor integrates with other Apify platform components and other external services:
- Webhooks: Automatically notify you when the scraping is complete or send the data to another application.
- API: Control the Actor programmatically using the Apify API.
- Cloud Services: Use Apify integrations to automatically store the data in services like Google Sheets, Google Drive, Slack, and others.
Scrape Any Web Data You Need with This Dynamic Scraper
This AI Web Scraper is your one-stop solution for scraping any data you need. Whether it's a product name, a price, a news headline, or a financial metric, this actor adapts to extract it by analyzing the context of your instructions.
Not What You Need? Build Your Own!
If this actor doesn't exactly meet your needs, you can use one of the scraper templates available in Python, JavaScript, and TypeScript to get started or check out our open-source library Crawlee.
You can also request a custom scraping solution from us.
Your Feedback
Your feedback is valuable to us. If you have any suggestions or find a bug, please create an issue on the Actor's Issues tab in the Apify Console.
FAQ
How much does AI Web Scraper cost?
This actor uses Apify's Pay-per-result pricing model. Apify also provides you with free monthly usage credits.
How can I use AI Web Scraper with the Apify API?
You can access the Apify API programmatically via RESTful HTTP endpoints or SDKs (apify-client NPM package for JavaScript, apify-client PyPI package for Python) to run, manage, and get the data out of any actor.
Is it legal to scrape data using the AI Web Scraper?
This actor only extracts data that is publicly available. Please ensure that you comply with the terms and conditions of websites you scrape, and you are responsible for ensuring your compliance with data privacy regulations such as GDPR.