# AI Web Scraper — Structured Data From Any URL (`muhammadafzal/ai-web-extractor`) Actor

Extract structured data from any website using an LLM and your own field schema — no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.

- **URL**: https://apify.com/muhammadafzal/ai-web-extractor.md
- **Developed by:** [Muhammad Afzal](https://apify.com/muhammadafzal) (community)
- **Categories:** AI, SEO tools, MCP servers
- **Stats:** 1 total users, 0 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $20.00 / 1,000 page processeds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Web Scraper — Structured Data From Any URL

Extract structured data from **any website** using an LLM and **your own field schema** — no CSS selectors, no per-site code. Give it URLs and the fields you want; get back clean JSON rows. Built for the messy long tail of sites that off-the-shelf scrapers don't cover: blogs, job boards, product pages, directories, listings, and more.

Export results, run via API, schedule and monitor runs, or integrate with other tools and AI agents.

---

### How it works

1. You provide one or more **URLs** and a list of **fields** (name + short description).
2. The actor fetches each page, converts it to clean text, and asks an LLM to return JSON matching your fields.
3. You get one row per record (or one row per repeating item in **list mode**).

No selectors to maintain. When a site changes its HTML, the LLM still finds your fields.

---

### Input

| Field | Type | Description |
|-------|------|-------------|
| `startUrls` | array | The page URLs to extract from. |
| `fields` | array | What to extract — `[{ "name": "title", "description": "the product title", "type": "string" }]`. |
| `listMode` | boolean | ON = one row per repeating item on the page (grids, listings). OFF = one row per page. |
| `model` | string | OpenRouter model slug (default `openai/gpt-4o-mini`). |
| `maxItems` | integer | Cap on total output rows. |
| `maxCrawlPages` | integer | Cap on pages fetched. |
| `maxContentChars` | integer | How much page text to send to the model (cost control). |
| `proxyConfiguration` | object | Apify proxy settings (datacenter by default). |

#### Example input

```json
{
  "startUrls": [{ "url": "https://quotes.toscrape.com" }],
  "fields": [
    { "name": "text", "description": "the full quote text" },
    { "name": "author", "description": "who said it" },
    { "name": "tags", "description": "list of tag labels", "type": "array" }
  ],
  "listMode": true,
  "model": "openai/gpt-4o-mini"
}
````

***

### API key (required)

Extraction runs through **[OpenRouter](https://openrouter.ai)** — set a single environment variable on the actor (Console → Settings → Environment variables):

```
OPENROUTER_API_KEY = sk-or-...
```

Pick any model via the `model` input — cheap models like `openai/gpt-4o-mini` or `google/gemini-2.5-flash` handle most structured extraction well. You pay OpenRouter directly for model usage; the actor's PPE events cover the extraction layer.

***

### Output

Every row contains `source_url`, `scraped_at`, `error`, plus **your fields**:

```json
{
  "text": "The world as we have created it is a process of our thinking.",
  "author": "Albert Einstein",
  "tags": ["change", "deep-thoughts", "thinking", "world"],
  "source_url": "https://quotes.toscrape.com",
  "scraped_at": "2026-06-07T12:00:00.000Z",
  "error": null
}
```

***

### Pricing (Pay Per Event)

| Event | When |
|-------|------|
| `actor-start` | Once per run. |
| `page-processed` | Each page successfully fetched and extracted (one LLM call). |

Failed pages (fetch error, model error, missing key) are **not charged**.

***

### Use cases

- **RAG / AI pipelines** — turn arbitrary pages into clean structured records.
- **Long-tail sites** — scrape sites with no dedicated actor.
- **Listings & directories** — pull every item from a results page with `listMode`.
- **Monitoring** — schedule extraction of the same fields over time.

***

### Tips

- Write clear field descriptions — they're the instructions the model follows.
- Use `listMode` for pages with many repeating records; turn it off for single detail pages.
- For JS-heavy sites where text is missing, increase `maxContentChars` or use a richer model.

# Actor input Schema

## `startUrls` (type: `array`):

The page URLs to extract data from. Use this when the user provides specific pages (article, product, listing, profile). Each URL is fetched and passed to the LLM for extraction.

## `fields` (type: `array`):

The data points to pull from each page. Each field has a name and a short description telling the model what to capture. Example: \[{"name":"title","description":"the product title"},{"name":"price","description":"price in USD as a number","type":"number"}].

## `listMode` (type: `boolean`):

Turn ON when each page contains MANY repeating items (e.g., a product grid, search results, a list of quotes) — the actor returns one row per item. Turn OFF when each page is a single record (e.g., one product detail page).

## `model` (type: `string`):

Internal: OpenRouter model used for extraction. Managed by the actor; not user-editable.

## `maxItems` (type: `integer`):

Maximum number of output rows to produce across all pages (cost control).

## `maxCrawlPages` (type: `integer`):

Maximum number of pages to fetch and send to the LLM (cost control).

## `maxContentChars` (type: `integer`):

How much page text to send to the model (controls token cost). Longer = more complete but more expensive.

## `proxyConfiguration` (type: `object`):

Proxy settings for fetching pages. Datacenter proxy is enabled by default; switch to residential for sites that block datacenter IPs.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://quotes.toscrape.com"
    }
  ],
  "fields": [
    {
      "name": "text",
      "description": "the full quote text"
    },
    {
      "name": "author",
      "description": "name of the person who said the quote"
    },
    {
      "name": "tags",
      "description": "list of tag labels on the quote",
      "type": "array"
    }
  ],
  "listMode": true,
  "model": "openai/gpt-4o-mini",
  "maxItems": 100,
  "maxCrawlPages": 20,
  "maxContentChars": 12000,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

# Actor output Schema

## `results` (type: `string`):

Link to the dataset containing all extracted rows.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://quotes.toscrape.com"
        }
    ],
    "fields": [
        {
            "name": "text",
            "description": "the full quote text"
        },
        {
            "name": "author",
            "description": "name of the person who said the quote"
        },
        {
            "name": "tags",
            "description": "list of tag labels on the quote",
            "type": "array"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("muhammadafzal/ai-web-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://quotes.toscrape.com" }],
    "fields": [
        {
            "name": "text",
            "description": "the full quote text",
        },
        {
            "name": "author",
            "description": "name of the person who said the quote",
        },
        {
            "name": "tags",
            "description": "list of tag labels on the quote",
            "type": "array",
        },
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("muhammadafzal/ai-web-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://quotes.toscrape.com"
    }
  ],
  "fields": [
    {
      "name": "text",
      "description": "the full quote text"
    },
    {
      "name": "author",
      "description": "name of the person who said the quote"
    },
    {
      "name": "tags",
      "description": "list of tag labels on the quote",
      "type": "array"
    }
  ]
}' |
apify call muhammadafzal/ai-web-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=muhammadafzal/ai-web-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Web Scraper — Structured Data From Any URL",
        "description": "Extract structured data from any website using an LLM and your own field schema — no CSS selectors. Give it URLs and the fields you want; get clean JSON rows back. Works on blogs, job boards, product pages, listings, and more.",
        "version": "1.0",
        "x-build-id": "SnzdAntHYgc5ef1qx"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/muhammadafzal~ai-web-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-muhammadafzal-ai-web-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/muhammadafzal~ai-web-extractor/runs": {
            "post": {
                "operationId": "runs-sync-muhammadafzal-ai-web-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/muhammadafzal~ai-web-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-muhammadafzal-ai-web-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "The page URLs to extract data from. Use this when the user provides specific pages (article, product, listing, profile). Each URL is fetched and passed to the LLM for extraction.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "fields": {
                        "title": "Fields to Extract",
                        "type": "array",
                        "description": "The data points to pull from each page. Each field has a name and a short description telling the model what to capture. Example: [{\"name\":\"title\",\"description\":\"the product title\"},{\"name\":\"price\",\"description\":\"price in USD as a number\",\"type\":\"number\"}].",
                        "default": [
                            {
                                "name": "text",
                                "description": "the full quote text"
                            },
                            {
                                "name": "author",
                                "description": "name of the person who said the quote"
                            },
                            {
                                "name": "tags",
                                "description": "list of tag labels on the quote",
                                "type": "array"
                            }
                        ]
                    },
                    "listMode": {
                        "title": "Extract a List of Items per Page",
                        "type": "boolean",
                        "description": "Turn ON when each page contains MANY repeating items (e.g., a product grid, search results, a list of quotes) — the actor returns one row per item. Turn OFF when each page is a single record (e.g., one product detail page).",
                        "default": true
                    },
                    "model": {
                        "title": "Extraction Model",
                        "type": "string",
                        "description": "Internal: OpenRouter model used for extraction. Managed by the actor; not user-editable.",
                        "default": "openai/gpt-4o-mini"
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of output rows to produce across all pages (cost control).",
                        "default": 100
                    },
                    "maxCrawlPages": {
                        "title": "Max Pages",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of pages to fetch and send to the LLM (cost control).",
                        "default": 20
                    },
                    "maxContentChars": {
                        "title": "Max Content Characters per Page",
                        "minimum": 1000,
                        "maximum": 60000,
                        "type": "integer",
                        "description": "How much page text to send to the model (controls token cost). Longer = more complete but more expensive.",
                        "default": 12000
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Proxy settings for fetching pages. Datacenter proxy is enabled by default; switch to residential for sites that block datacenter IPs.",
                        "default": {
                            "useApifyProxy": true
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
