# Smart AI Web Scraper (`cockroachapi/smart-ai-web-scraper`) Actor

Unlock the power of Smart AI Web Scraper! Efficiently scrape dynamic content, simulate browser behavior, and extract targeted data.

- **URL**: https://apify.com/cockroachapi/smart-ai-web-scraper.md
- **Developed by:** [Cockroach API](https://apify.com/cockroachapi) (community)
- **Categories:** AI, Agents, Automation
- **Stats:** 17 total users, 4 monthly users, 100.0% runs succeeded, 3 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Smart AI Web Scraper

Unlock the power of **Smart AI Web Scraper**! Efficiently **scrape dynamic content**, **simulate browser behavior**, and **extract targeted data** without writing a single line of code.

### 🚀 Overview

The **Smart AI Web Scraper** is an intelligent, next-generation automation tool powered by Stagehand and built for seamless **AI data extraction**. Instead of relying on rigid CSS selectors or complex scripts, this **no-code web scraper** uses natural language processing (powered by large language models / LLMs) to navigate web pages, perform actions, and extract precisely what you need into structured JSON formats.

Whether you're looking to **scrape dynamic content** built with React/Vue, or you need to **simulate browser behavior** to bypass simple anti-bot measures, this **AI web scraper** handles it all efficiently.

### ✨ Features

- **Natural Language Actions:** Command the browser using plain English. E.g., *"Click the 'Load More' button"* or *"Scroll to the bottom of the page"*.
- **Intelligent Data Extraction:** Define the fields you want to extract (e.g., "Product Price", "Article Author"), and the underlying AI will locate and format the data.
- **Dynamic Content Handling:** Render and interact with the most complex, JavaScript-heavy single-page applications with ease, ensuring nothing is missed.
- **Structured JSON Output:** Perfect for **automation** pipelines, database ingestion, or integrating with your existing APIs.

### 💡 Actor Use Examples

Here are some ways you can use the **Smart AI Web Scraper** to **extract targeted data** effortlessly:

#### Example 1: E-commerce Product Extraction

- **Start URL:** `https://example-store.com/category/shoes`
- **Actions:**
  - `Click the 'Accept Cookies' button`
  - `Scroll down to load all products`
- **Extraction Fields:**
  - `productName` (String)
  - `price` (Number)
  - `inStock` (Boolean)

#### Example 2: News Article Scraping

- **Start URL:** `https://news-site.com/latest`
- **Actions:**
  - `Click on the first article link`
- **Extraction Fields:**
  - `headline` (String)
  - `author` (String)
  - `publishedDate` (String)
  - `articleBody` (String)

#### Example 3: Real Estate Listings

- **Start URL:** `https://real-estate-site.com/search?city=NY`
- **Actions:**
  - `Click the 'Next Page' pagination button` (Repeated)
- **Extraction Fields:**
  - `propertyAddress` (String)
  - `price` (String)
  - `numberOfBedrooms` (Number)

### 🛠️ How it Works

This **LLM scraper** integrates cutting-edge AI with reliable, self-healing browser automation. Instead of hardcoded rules, the AI "sees" the page and navigates like a human, ensuring high accuracy and stability.

Forget constantly breaking scrapers due to minor UI updates. Our **Smart AI Web Scraper** adapts to visual and structural changes dynamically, ensuring your **automation** workflows remain uninterrupted.

### 📦 Output Format

The actor outputs clean, validated **JSON** data directly into your Apify dataset. Each run generates structured results that perfectly match your requested fields.

### ⚡ Standby Mode (Real-time HTTP API)

This Actor supports **Standby Mode**, which allows it to run continuously as an HTTP server. This eliminates the container startup time, allowing you to extract data in real-time via REST API requests.

#### How to use Standby Mode

1. Deploy the Actor to the Apify Platform.
2. In the Apify Console, go to the Actor's **Settings** and ensure **Standby mode** is enabled (it should be by default).
3. Start the Actor in Standby mode.
4. Send an HTTP `POST` request to the Standby URL provided in the Apify Console.

#### Example Request

```bash
curl -X POST https://<STANDBY_URL> \
  -H "Content-Type: application/json" \
  -d '{
    "startUrl": "https://example.com",
    "actions": [
      {
        "action": "click the accept cookies button",
        "waitBeforeAction": 1,
        "waitAfterAction": 2
      }
    ],
    "fields": [
      {
        "fieldName": "title",
        "fieldDescription": "The main heading of the page",
        "dataType": "string"
      }
    ],
    "proxyConfiguration": {
      "useApifyProxy": true
    }
  }'
````

#### Example Response

```json
{
  "title": "Example Domain"
}
```

The response will be the exact structured JSON data extracted by the AI, returned instantly in the HTTP response body.

# Actor input Schema

## `startUrl` (type: `string`):

URL for the scraper to visit

## `actions` (type: `array`):

Define a sequence of actions to perform on the page before extracting data.

## `fields` (type: `array`):

Define the fields you want to extract with their data types

## `proxyConfiguration` (type: `object`):

Proxy settings to avoid blocking. Defaults to US Residential proxies.

## Actor input object example

```json
{
  "startUrl": "https://www.scrapethissite.com/pages/",
  "actions": [
    {
      "action": "Click heading: Hockey Teams: Forms, Searching and Pagination",
      "waitAfterAction": 0.5,
      "waitBeforeAction": 0.5
    }
  ],
  "fields": [
    {
      "dataType": "text",
      "fieldName": "Team Name"
    },
    {
      "dataType": "number",
      "fieldName": "Year"
    },
    {
      "dataType": "number",
      "fieldName": "Wins"
    },
    {
      "dataType": "number",
      "fieldName": "Losses"
    },
    {
      "dataType": "number",
      "fieldName": "Win %"
    },
    {
      "dataType": "number",
      "fieldName": "Goals Against"
    },
    {
      "dataType": "number",
      "fieldName": "Goals For"
    }
  ],
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://www.scrapethissite.com/pages/",
    "actions": [
        {
            "action": "Click heading: Hockey Teams: Forms, Searching and Pagination",
            "waitAfterAction": 0.5,
            "waitBeforeAction": 0.5
        }
    ],
    "fields": [
        {
            "dataType": "text",
            "fieldName": "Team Name"
        },
        {
            "dataType": "number",
            "fieldName": "Year"
        },
        {
            "dataType": "number",
            "fieldName": "Wins"
        },
        {
            "dataType": "number",
            "fieldName": "Losses"
        },
        {
            "dataType": "number",
            "fieldName": "Win %"
        },
        {
            "dataType": "number",
            "fieldName": "Goals Against"
        },
        {
            "dataType": "number",
            "fieldName": "Goals For"
        }
    ],
    "proxyConfiguration": {
        "useApifyProxy": false
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("cockroachapi/smart-ai-web-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrl": "https://www.scrapethissite.com/pages/",
    "actions": [{
            "action": "Click heading: Hockey Teams: Forms, Searching and Pagination",
            "waitAfterAction": 0.5,
            "waitBeforeAction": 0.5,
        }],
    "fields": [
        {
            "dataType": "text",
            "fieldName": "Team Name",
        },
        {
            "dataType": "number",
            "fieldName": "Year",
        },
        {
            "dataType": "number",
            "fieldName": "Wins",
        },
        {
            "dataType": "number",
            "fieldName": "Losses",
        },
        {
            "dataType": "number",
            "fieldName": "Win %",
        },
        {
            "dataType": "number",
            "fieldName": "Goals Against",
        },
        {
            "dataType": "number",
            "fieldName": "Goals For",
        },
    ],
    "proxyConfiguration": { "useApifyProxy": False },
}

# Run the Actor and wait for it to finish
run = client.actor("cockroachapi/smart-ai-web-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://www.scrapethissite.com/pages/",
  "actions": [
    {
      "action": "Click heading: Hockey Teams: Forms, Searching and Pagination",
      "waitAfterAction": 0.5,
      "waitBeforeAction": 0.5
    }
  ],
  "fields": [
    {
      "dataType": "text",
      "fieldName": "Team Name"
    },
    {
      "dataType": "number",
      "fieldName": "Year"
    },
    {
      "dataType": "number",
      "fieldName": "Wins"
    },
    {
      "dataType": "number",
      "fieldName": "Losses"
    },
    {
      "dataType": "number",
      "fieldName": "Win %"
    },
    {
      "dataType": "number",
      "fieldName": "Goals Against"
    },
    {
      "dataType": "number",
      "fieldName": "Goals For"
    }
  ],
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}' |
apify call cockroachapi/smart-ai-web-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=cockroachapi/smart-ai-web-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Smart AI Web Scraper",
        "description": "Unlock the power of Smart AI Web Scraper! Efficiently scrape dynamic content, simulate browser behavior, and extract targeted data.",
        "version": "1.0",
        "x-build-id": "to2YJz4NH3N6v1noH"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/cockroachapi~smart-ai-web-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-cockroachapi-smart-ai-web-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/cockroachapi~smart-ai-web-scraper/runs": {
            "post": {
                "operationId": "runs-sync-cockroachapi-smart-ai-web-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/cockroachapi~smart-ai-web-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-cockroachapi-smart-ai-web-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl",
                    "fields"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "URL for the scraper to visit"
                    },
                    "actions": {
                        "title": "Actions (Pre-extraction)",
                        "type": "array",
                        "description": "Define a sequence of actions to perform on the page before extracting data.",
                        "items": {
                            "type": "object",
                            "properties": {
                                "waitBeforeAction": {
                                    "title": "Wait for Timeout (in seconds) before the action",
                                    "type": "number",
                                    "description": "Number of seconds to wait before performing this action. Defaults to 0.5 seconds.",
                                    "prefill": 0.5,
                                    "minimum": 0,
                                    "maximum": 300
                                },
                                "action": {
                                    "title": "Action (Natural Language)",
                                    "type": "string",
                                    "description": "Describe the action to perform in natural language (e.g., 'Click the \"Load More\" button', 'Scroll to the bottom of the page', 'Hover over the product image').",
                                    "editor": "textfield"
                                },
                                "waitAfterAction": {
                                    "title": "Wait for Timeout (in seconds) after the action",
                                    "type": "number",
                                    "description": "Number of seconds to wait after performing this action. Defaults to 0.5 seconds.",
                                    "prefill": 0.5,
                                    "minimum": 0,
                                    "maximum": 300
                                }
                            },
                            "required": [
                                "action"
                            ]
                        }
                    },
                    "fields": {
                        "title": "Fields to Extract",
                        "type": "array",
                        "description": "Define the fields you want to extract with their data types",
                        "items": {
                            "type": "object",
                            "properties": {
                                "fieldName": {
                                    "title": "Field Name",
                                    "type": "string",
                                    "description": "Name of the field to extract",
                                    "editor": "textfield"
                                },
                                "dataType": {
                                    "title": "Data Type",
                                    "type": "string",
                                    "description": "Type of data for this field",
                                    "editor": "select",
                                    "enum": [
                                        "text",
                                        "number",
                                        "array",
                                        "boolean"
                                    ],
                                    "enumTitles": [
                                        "Text",
                                        "Number",
                                        "Array",
                                        "Boolean"
                                    ],
                                    "default": "text",
                                    "prefill": "text"
                                }
                            },
                            "required": [
                                "fieldName",
                                "dataType"
                            ]
                        }
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Proxy settings to avoid blocking. Defaults to US Residential proxies."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
