Pricing

Pay per event

Ai Web Scraper - Extract Data With Ease

Ai Web Scraper enables scraping for everyone, including non-techies! It uses Google's Gemini LLM to scrape websites with natural language commands. It dynamically extracts data, no selector input needed, handles dynamic content and cookie consent, avoids bot detection, outputs JSON or other formats.

Pricing

Pay per event

Rating

1.0

(2)

Developer

Paco

Actor stats

Bookmarked

1.1K

Total users

Monthly active users

2 months ago

Last modified

AI Web Scraper

This AI Web Scraper is a powerful and flexible tool that uses the power of Large Language Models (LLMs), specifically Google's Gemini, to intelligently extract data from web pages. Unlike traditional scrapers that rely on pre-defined selectors, this actor lets you specify what data you need in natural language, and it automatically adapts to extract it.

🚀 Performance Optimized: This version includes significant performance improvements including concurrent processing, intelligent content growth detection, memory optimization, and batch data processing for faster and more reliable scraping.

DISCLAIMER!! : The goal of this scraper is to make it possible for everyone (also non- techies) to scrape. It is still quite experimental since it relies on certain vision capabilities, meaning results can be sometimes inconsistent or not entirely what you'd expect.

What Does This Actor Do?

This actor automates the process of web scraping by combining browser automation with AI-powered element identification. Here’s a breakdown of its key capabilities:

Dynamic Data Extraction: You BRIEFLY specify the items you want to scrape separated by commas in natural language (e.g., "product name, product price") and the actor intelligently identifies and extracts those values from the web page.
Intelligent Element Identification: Leveraging Google's Gemini LLM, it analyzes web page screenshots to pinpoint the location of relevant elements and labels them, even if the website's structure is unfamiliar.
Performance Optimized Scrolling: Uses intelligent content growth detection instead of fixed delays, making scrolling faster and more reliable while respecting timeout safeguards.
Concurrent Processing: Processes multiple screenshots simultaneously with controlled concurrency to maximize speed while respecting API rate limits.
Memory Efficient: Automatically clears screenshots from memory after processing to prevent memory accumulation during large scraping jobs.
Batch Data Processing: Saves data in optimized batches instead of individual items, reducing API overhead and improving performance.
Smart Data Consolidation: Automatically removes duplicate items and keeps only the most complete data. Groups items by their most discriminative fields (e.g., product name) and selects the version with the most complete information, ensuring cleaner datasets.
Robust Error Handling: Includes comprehensive error handling, timeouts, and resource cleanup to ensure reliable operation.
Flexible Data Structure: The actor returns the data in a structured JSON format, with labels derived from the user's instructions or the bounding boxes provided by Gemini itself, making it easy to use in your own applications, reports, or spreadsheets.
Avoids Bot Detection: Takes several measures to avoid bot detection (using realistic user-agents and headless browser settings).

How to Use the AI Web Scraper

Using this actor is straightforward:

Create an Apify Account: Start with a free Apify account using your email.
Open the Smart Web Data Extractor: Go to the actor page.
Provide Instructions and URLs: Input your desired instructions e.g., product name, product price and one or more target URLs.
Run the Actor: Click the "Start" button and wait for the data to be extracted.
Download Your Data: Retrieve the scraped data in JSON format.

Input

To start scraping data, the actor accepts the following input parameters:

Core Parameters

start_urls: An array of at least two URLs of the web pages you want to scrape
instructions: A list of items you wish to scrape, separated by commas. e.g., "product name, product price"

Performance Parameters (Optional)

max_concurrent_screenshots: Maximum number of screenshots to process simultaneously (default: 4)
screenshot_timeout: Timeout in seconds for each screenshot analysis (default: 60)

Data Quality Parameters (Optional)

enable_smart_consolidation: Automatically remove duplicate items and keep only the most complete data (default: true, highly recommended)

Scrolling Options

has_infinite_scroll: Enable intelligent infinite scrolling with content growth detection
above_fold_only: Only capture content visible without scrolling

Other Options

save_screenshots: Save screenshots to key-value store for debugging
device_type: Choose between "desktop" or "mobile" viewport simulation
mobile_device_model: Specific mobile device to emulate when using mobile device type

Here’s an example of an input configuration in JSON format:

{
    "instructions": "Product name, Product price, SKU number, Product Dimensions",
    "start_urls": [
        "https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848",
		"https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036",
		"https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html"
    ],
    "max_concurrent_screenshots": 6,
    "screenshot_timeout": 60,
    "has_infinite_scroll": false,
    "above_fold_only": false,
    "device_type": "desktop",
    "enable_smart_consolidation": true
}

Output

The output from this Actor is stored in a dataset. You can view this data in the Apify UI or download it in JSON, CSV or other formats. Here is the example corresponding to the input example provided above:

{
  "url": "https://www.boontoon.com/metal-wall-hanging-of-lord-ganesha-divinity-and-elegance-bh-0848",
  "data": {
    "Product name": "Metal Wall Hanging Of Lord Ganesha- Divinity And Elegance",
    "Product price": "₹ 470.00 /piece",
    "SKU number": "0",
    "Product Dimensions": "0"
  }
},
{
  "url": "https://www.boontoon.com/circular-yellow-bag-with-floral-print-and-elephant-design-rja-0036",
  "data": {
    "Product name": "Circular Yellow Bag With Floral Print And Elephant Design",
    "Product price": "₹ 350.00 /piece",
    "SKU number": "0",
    "Product Dimensions": "0"
  }
},
{
  "url": "https://www.ledlichtdiscounter.nl/1-fase-rail-connector-i-vorm-zwart.html",
  "data": {
    "Product name": "1-Fase rail connector - I-vorm - Zwart",
    "Product price": "€ 1,95",
    "SKU number": "PLX849640"
  }
}

How Can I Use the Data Extracted with AI Web Scraper?

Market Research: Extract product information, pricing, and customer reviews for competitive analysis.
Content Aggregation: Collect data for news aggregation, research, or blog content.
Financial Analysis: Gather financial metrics and performance data from various financial websites.
E-commerce Intelligence: Extract and monitor product and pricing information from online stores.
Lead Generation: Collect relevant information for potential business opportunities.

How Does the AI Web Scraper Work?

The AI Web Scraper combines advanced browser automation with Google's Gemini LLM to offer a cutting-edge solution for web scraping. This actor operates in multiple stages to ensure efficient, accurate, and flexible data extraction.

1. Input Configuration

User Instructions: The scraper accepts natural language instructions describing the data to extract, such as "product name, price, and dimensions."
Start URLs: A list of URLs serves as the input target for scraping.

2. AI-Powered Element Detection

Screenshot Analysis: When provided multiple URLs of the same domain, only the first page is used to detect the elements. Consecutive URLs use CSS Selectors, preventing high LLM costs. The scraper takes a screenshot of the web page and uses the Gemini LLM to identify bounding boxes around relevant elements. This enables dynamic and adaptive data extraction without requiring hardcoded selectors.
Bounding Box Parsing: The bounding box coordinates returned by the AI model are mapped to the DOM structure of the web page.

3. Intelligent Content Growth Detection

Instead of using fixed delays, the scraper waits for actual content growth after scrolling operations.
Uses event-driven waits with timeout safeguards (max 3 seconds) to ensure optimal performance.
Detects when pages have finished loading new content before proceeding with screenshot capture.

4. Selector Extraction

For each matched DOM element, the scraper generates a CSS selector and extracts its corresponding HTML content. This enables robust and reusable data extraction across similar pages.

5. Data Extraction

The scraper retrieves text or attributes (e.g., innerHTML, text) from elements identified by the selectors, ensuring the collected data aligns precisely with user instructions.

6. Output Formatting

Extracted data is structured in JSON format. Each record contains the URL of the scraped page and the extracted data, making it easy to integrate into various applications.

7. Domain Optimization

To enhance efficiency, the scraper caches selectors for domains it has processed. Subsequent pages from the same domain reuse these selectors, reducing the need for repeated AI analysis.

Key Features:

AI-Driven Flexibility: Removes the need for predefined selectors by leveraging AI to dynamically identify elements.
Performance Optimized: Concurrent processing, intelligent scrolling, and memory management for faster operation.
Robust Error Handling: Comprehensive timeout handling, resource cleanup, and graceful error recovery.
Batch Processing: Efficient data saving in batches to reduce API overhead and improve performance.
Memory Efficient: Automatic cleanup of screenshots and resources to prevent memory accumulation.
Configurable Concurrency: Adjustable concurrent processing limits to balance speed and API rate limits.
Bot Detection Avoidance: Implements strategies like custom user agents and headless browsing to minimize detection.
Structured Outputs: Outputs are clean and easy to use in JSON, CSV, or other formats.

Advantages of This Approach:

User-Friendly: Natural language instructions make it accessible to non-technical users.
Highly Adaptable: Capable of handling unknown or dynamically loaded page structures.
Performance Optimized: Concurrent processing, intelligent waits, and memory management for faster operation.
Reliable: Comprehensive error handling and timeout management ensure consistent operation.
Scalable: Configurable concurrency and batch processing handle large scraping jobs efficiently.

Smart Data Consolidation

The AI Web Scraper includes an intelligent data consolidation system that automatically improves data quality by removing duplicates and keeping only the most complete information.

How It Works

When scraping pages with multiple screenshots (especially with infinite scroll), the same items often appear multiple times with varying levels of completeness. For example:

Before Consolidation:

[
  {"title": "MacBook Pro", "price": "$1,999", "discount_price": ""},
  {"title": "MacBook Pro", "price": "$1,999", "discount_price": "$1,799"},
  {"title": "MacBook Pro", "price": "", "rating": "4.8"}
]

After Smart Consolidation:

[
  {"title": "MacBook Pro", "price": "$1,999", "discount_price": "$1,799", "rating": "4.8"}
]

Key Benefits

Eliminates Duplicates: Automatically groups items that refer to the same entity
Maximizes Completeness: Keeps the version with the most complete information
Schema Agnostic: Works with any field combination you specify in your instructions
Zero Configuration: No setup required - it automatically adapts to your data structure
Cost Effective: Pure algorithmic approach with no additional API calls

Technical Details

The system uses advanced data distribution analysis to:

Identify discriminative fields (e.g., product names, titles) for grouping
Group similar items using the most reliable identifying characteristics
Select the most complete version from each group based on non-empty field count
Merge consistent values when all items in a group agree on a field value

This feature is enabled by default and highly recommended for cleaner, higher-quality datasets. You can disable it by setting enable_smart_consolidation: false in your input if needed.

Integrations

This Actor integrates with other Apify platform components and other external services:

Webhooks: Automatically notify you when the scraping is complete or send the data to another application.
API: Control the Actor programmatically using the Apify API.
Cloud Services: Use Apify integrations to automatically store the data in services like Google Sheets, Google Drive, Slack, and others.

Scrape Any Web Data You Need with This Dynamic Scraper

This AI Web Scraper is your one-stop solution for scraping any data you need. Whether it's a product name, a price, a news headline, or a financial metric, this actor adapts to extract it by analyzing the context of your instructions.

Not What You Need? Build Your Own!

If this actor doesn't exactly meet your needs, you can use one of the scraper templates available in Python, JavaScript, and TypeScript to get started or check out our open-source library Crawlee.

You can also request a custom scraping solution from us.

Your Feedback

Your feedback is valuable to us. If you have any suggestions or find a bug, please create an issue on the Actor's Issues tab in the Apify Console.

FAQ

How much does AI Web Scraper cost?

This actor uses Apify's Pay-per-result pricing model. Apify also provides you with free monthly usage credits.

How can I use AI Web Scraper with the Apify API?

You can access the Apify API programmatically via RESTful HTTP endpoints or SDKs (apify-client NPM package for JavaScript, apify-client PyPI package for Python) to run, manage, and get the data out of any actor.

Is it legal to scrape data using the AI Web Scraper?

This actor only extracts data that is publicly available. Please ensure that you comply with the terms and conditions of websites you scrape, and you are responsible for ensuring your compliance with data privacy regulations such as GDPR.

RAG Web Browser

apify/rag-web-browser

Web browser for OpenAI Assistants, RAG pipelines, or AI agents, similar to a web browser in ChatGPT. It queries Google Search, scrapes the top N pages, and returns their content as Markdown for further processing by an LLM. It can also scrape individual URLs.

Apify

23K

3.4

Universal AI GPT Scraper

louisdeconinck/ai-gpt-scraper

Transform any website into structured data with AI-powered extraction. This versatile tool combines advanced web scraping with intelligent content analysis to deliver clean, customized JSON output - perfect for automating data collection from any web source.

Louis Deconinck

162

5.0

Slack Message Generator

katerinahronik/slack-message

This actor sends messages to Slack automatically. It can be used instead of email notifications and is ideal to combine with other actors monitoring successful runs, errors, etc.

Kateřina Hroníková

145

🏡 Leboncoin Real Estate Data Extractor

adventurous_nut/leboncoin-real-estate-data-extractor

🚀 A powerful Leboncoin real estate data extractor that turns property ads into clean, structured, and investment-ready JSON. 📊 Collects detailed pricing breakdowns, 🏡 property features, 📍 geolocation, 👤 seller intelligence, 🖼 full media galleries, and ⏱ lifecycle tracking...

abdelhafid hasnaoui

Extended GPT Scraper

drobnikj/extended-gpt-scraper

Extract data from any website and feed it into GPT via the OpenAI API. Use ChatGPT to proofread content, analyze sentiment, summarize reviews, extract contact details, and much more.

Jakub Drobník

1.6K

4.8

Slack Messages Downloader

zuzka/slack-messages-downloader

Download message history from a public Slack channel. Extract message text, timestamps, user IDs, reactions, thread metadata, and files. Automatically paginates results and exports data in JSON, CSV, Excel, XML, or HTML for backups and reports.

Zuzka Pelechová

Website Content to Markdown for LLM Training

easyapi/website-content-to-markdown-for-llm-training

🚀 Transform web content into clean, LLM-ready Markdown! 📘 Scrape multiple pages, extract main content, and convert to Markdown format. Perfect for AI researchers, data scientists, and LLM developers. Fast, efficient, and customizable. Supercharge your AI training data today! 🌐📝🧠

EasyApi

252

5.0

Extract-any-webpage-content-for-llm

ai-developer/extract-any-webpage-content-for-llm

Fast and easy way to extract data from any webpage and are LLM friendly. The tool lets you easily extract content from any website. Ideal for researchers, marketers, and developers.

aideveloper

617

Webpage to Markdown

extremescrapes/webpage-to-markdown

This actor cost-effectively converts websites into structured markdown optimized for AI processing. It extracts webpage content, formats it into clean markdown, and ensures compatibility with AI models.

Extreme Scrapes

176

5.0

Web Crawler

rigelbytes/webcrawler

This web crawler is designed to provide users with complete flexibility by allowing them to use their **own proxies**. The scraper collects all pages from the website and returns extracts the **MetaData**, **Title**, and **Content** of the page in MarkDown.

Rigel Bytes