# AI Universal Scraper — Extract Anything from Any Page (`oriented_wallpaper/ai-universal-scraper`) Actor

Give any URL + the fields you want; an LLM (OpenAI or Anthropic) extracts clean structured JSON from the page. Works on any site.

- **URL**: https://apify.com/oriented\_wallpaper/ai-universal-scraper.md
- **Developed by:** [Flash Scrape](https://apify.com/oriented_wallpaper) (community)
- **Categories:** Automation, AI, Agents
- **Stats:** 1 total users, 0 monthly users, 0.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🤖 AI Universal Scraper — Extract Anything from Any Page

**Point it at any URL, tell it what you want, get clean structured JSON.** No selectors, no per-site setup, no brittle parsers. An LLM reads the page like a human and returns exactly the fields you asked for — so it works on sites no traditional scraper anticipates.

---

### ✨ Why it's different

Traditional scrapers break when a site changes its HTML. This one **understands the content**:

- 🌍 **Works on any page** — products, articles, listings, profiles, docs
- 🧠 **You describe the data in plain words** — `title`, `price`, `author`, `rating`… or a full instruction
- 📦 **Clean JSON out** — one object per page, or an array when a page lists many items
- 🔌 **Your model, your key** — bring **OpenAI** or **Anthropic**; you control cost & quality
- 💸 **Cost control** — cap how much page text is sent to the model

### 🎯 Use cases

- Scrape a competitor's product page → `name`, `price`, `availability`
- Turn any article into `{title, author, summary, date}`
- Pull every item from a listing page as structured rows
- Build datasets from sites that have no API and no existing scraper

### ⚙️ Input

| Field | Description |
|---|---|
| **URLs to scrape** | One or more page URLs |
| **Fields to extract** | The data points you want (e.g. `title`, `price`, `rating`) |
| **Extra instructions** | Optional plain-English guidance |
| **LLM provider** | `openai` or `anthropic` |
| **LLM API key** | Your key (stored encrypted) |
| **Model** | Optional override (defaults: `gpt-4o-mini` / `claude-haiku-4-5`) |

```json
{
  "startUrls": ["https://example.com/product/123"],
  "fields": ["name", "price", "rating", "in_stock"],
  "llmProvider": "openai",
  "apiKey": "sk-...",
  "maxChars": 12000
}
````

### 📤 Output (sample)

```json
{
  "url": "https://example.com/product/123",
  "name": "Wireless Headphones X200",
  "price": "$89.99",
  "rating": 4.6,
  "in_stock": true
}
```

When a page lists multiple items, you get one row per item. Export as **JSON, CSV, or Excel**.

### ❓ FAQ

**Do I need an API key?** Yes — your own OpenAI or Anthropic key. You pay the model provider directly for tokens; this actor handles the fetching, cleaning, prompting and parsing.

**How is cost controlled?** Only the first `maxChars` of cleaned page text is sent to the model (default 12,000). Lower it for cheaper runs, raise it for long pages.

**Is it reliable?** The page is cleaned to text first, the model is asked for strict JSON, and the output is parsed defensively (handles code fences / stray text).

***

Built by [Zakariae Belfkih](https://www.linkedin.com/in/zakariae-belfkih) · integration, automation & AI developer.

# Actor input Schema

## `startUrls` (type: `array`):

One or more page URLs to extract data from.

## `fields` (type: `array`):

The data points you want, e.g. "title", "price", "author", "rating". Leave empty if you use Instructions instead.

## `instructions` (type: `string`):

Natural-language guidance, e.g. "Extract every news item as a separate object with title and points."

## `llmProvider` (type: `string`):

Which model provider to use.

## `apiKey` (type: `string`):

Your OpenAI or Anthropic API key. Stored encrypted; used only to call the model.

## `model` (type: `string`):

Override the model. Defaults: OpenAI → gpt-4o-mini, Anthropic → claude-haiku-4-5.

## `maxChars` (type: `integer`):

How much page text to send to the model (controls cost). Default 12000.

## Actor input object example

```json
{
  "startUrls": [
    "https://example.com/product/123"
  ],
  "fields": [
    "title",
    "summary",
    "author"
  ],
  "llmProvider": "openai",
  "maxChars": 12000
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://news.ycombinator.com"
    ],
    "fields": [
        "title",
        "summary",
        "author"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("oriented_wallpaper/ai-universal-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": ["https://news.ycombinator.com"],
    "fields": [
        "title",
        "summary",
        "author",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("oriented_wallpaper/ai-universal-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://news.ycombinator.com"
  ],
  "fields": [
    "title",
    "summary",
    "author"
  ]
}' |
apify call oriented_wallpaper/ai-universal-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=oriented_wallpaper/ai-universal-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Universal Scraper — Extract Anything from Any Page",
        "description": "Give any URL + the fields you want; an LLM (OpenAI or Anthropic) extracts clean structured JSON from the page. Works on any site.",
        "version": "0.1",
        "x-build-id": "kaUkegc70lrfgDO18"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/oriented_wallpaper~ai-universal-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-oriented_wallpaper-ai-universal-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/oriented_wallpaper~ai-universal-scraper/runs": {
            "post": {
                "operationId": "runs-sync-oriented_wallpaper-ai-universal-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/oriented_wallpaper~ai-universal-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-oriented_wallpaper-ai-universal-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls",
                    "apiKey"
                ],
                "properties": {
                    "startUrls": {
                        "title": "URLs to scrape",
                        "type": "array",
                        "description": "One or more page URLs to extract data from.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "fields": {
                        "title": "Fields to extract",
                        "type": "array",
                        "description": "The data points you want, e.g. \"title\", \"price\", \"author\", \"rating\". Leave empty if you use Instructions instead.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "instructions": {
                        "title": "Extra instructions (optional)",
                        "type": "string",
                        "description": "Natural-language guidance, e.g. \"Extract every news item as a separate object with title and points.\""
                    },
                    "llmProvider": {
                        "title": "LLM provider",
                        "enum": [
                            "openai",
                            "anthropic"
                        ],
                        "type": "string",
                        "description": "Which model provider to use.",
                        "default": "openai"
                    },
                    "apiKey": {
                        "title": "LLM API key",
                        "type": "string",
                        "description": "Your OpenAI or Anthropic API key. Stored encrypted; used only to call the model."
                    },
                    "model": {
                        "title": "Model (optional)",
                        "type": "string",
                        "description": "Override the model. Defaults: OpenAI → gpt-4o-mini, Anthropic → claude-haiku-4-5."
                    },
                    "maxChars": {
                        "title": "Max characters per page",
                        "minimum": 1000,
                        "type": "integer",
                        "description": "How much page text to send to the model (controls cost). Default 12000.",
                        "default": 12000
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
