# AI Web Scraper - Extract Any Website by Example (`oriented_wallpaper/ai-web-scraper`) Actor

AI web scraper that extracts any website by example — paste a URL and a value you see on the page (a price, title, or name) and it learns the HTML pattern and pulls every similar item as structured rows. No CSS selectors, no API key. Export CSV/JSON/Excel.

- **URL**: https://apify.com/oriented\_wallpaper/ai-web-scraper.md
- **Developed by:** [Flash Scrape](https://apify.com/oriented_wallpaper) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.000035 / actor start

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Web Scraper - Extract Any Website by Example

A **no code web scraper** that turns **any website** into clean, structured data — without writing a single CSS selector, XPath, or line of code. This is **web scraping by example**: paste a URL, paste one or two values you can actually *see* on the page (a price, a title, a name), and the scraper learns the surrounding HTML pattern and pulls **every similar item** into rows you can export to CSV, JSON, or Excel. No API key. No fragile selectors to maintain.

If you have ever wanted to **scrape a website without coding**, this is the simplest way to do it: show the actor what you want by example, and it figures out the rest.

### How to scrape any website by example (3 steps)

1. **Paste the page URL.** Use a list, category, or search page that has repeating items (products, quotes, listings, search results).
2. **Paste example values you can see on the page** — one per line. Optionally label them as `label: value` (for example `author: Albert Einstein`) so your output columns get clean names.
3. **Run it.** The scraper finds each example in the HTML, learns the wrapping tag and class, extracts every element matching that pattern, and zips the fields into structured rows.

That is the whole workflow. No browser extension to install, no point-and-click recorder that breaks on the next layout change, and no selector knowledge required. You teach the scraper by example and it generalizes the rule across the entire page.

### What makes this different

Most "by-example" scrapers give you values and leave you guessing whether they're right. This one shows its work:

- **Confidence score on every row** — `_confidence` (0-1) plus `_fieldsFilled` tells you how reliable each extraction is, so you can trust or filter the output instead of eyeballing it.
- **The learned selector, exposed** — the run saves an `EXTRACTION_SCHEMA` (and logs it) showing the exact `tag.class` selector, detected type, match count, and confidence it inferred for each field. Full transparency, easy debugging.
- **Type detection + optional normalization** — it tags each field as number / price / percent / date / text and, with `normalizeValues` on, converts prices and numbers into real numbers in the output.
- **Multiple examples per field** — give the same label on several lines and the scraper uses them together for a more robust pattern (and higher confidence).
- **Pagination follow** — set `maxPages` and it follows `rel="next"` / "Next" links across pages automatically.

### What data you get

Every run returns one **row per extracted item**. Each row contains:

- `sourceUrl` — the page the item was extracted from.
- One column **per example you provided**, named by your label (e.g. `quote`, `author`, `price`, `title`).
- With metadata on (default): `_confidence`, `_fieldsFilled`, `_types`, and `_page`.

Because columns come from *your* labels, the output schema matches exactly what you asked for — no junk fields, no nested mess. Export the dataset to **CSV, JSON, or Excel** straight from the run.

### Input

| Field | Required | Description |
|---|---|---|
| `startUrls` | Yes | Pages to scrape — typically list / category / search pages with repeating items. |
| `examples` | Yes | Values visible on the page, one per line. Label them `author: Albert Einstein` to name the output columns. Repeat a label on multiple lines for a more robust pattern. |
| `maxItems` | No | Stop after this many rows across all URLs. Use `0` for no limit (default). |
| `maxPages` | No | Follow pagination ("Next" / `rel="next"`) up to this many pages per URL. Default `1`. |
| `normalizeValues` | No | Convert detected numbers / prices / percents into real numbers. Default `false`. |
| `includeMeta` | No | Add per-row `_confidence`, `_fieldsFilled`, `_types`, `_page`. Default `true`. |

#### Example input

```json
{
  "startUrls": [{ "url": "https://quotes.toscrape.com/" }],
  "examples": ["quote: process of our thinking", "author: Albert Einstein"],
  "maxItems": 0
}
````

### JSON output sample

For the input above, the scraper returns one row per quote on the page:

```json
[
  {
    "sourceUrl": "https://quotes.toscrape.com/",
    "quote": "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.",
    "author": "Albert Einstein",
    "_confidence": 0.95,
    "_fieldsFilled": "2/2",
    "_types": { "quote": "text", "author": "text" },
    "_page": 1
  },
  {
    "sourceUrl": "https://quotes.toscrape.com/",
    "quote": "It is our choices, Harry, that show what we truly are, far more than our abilities.",
    "author": "J.K. Rowling",
    "_confidence": 0.95,
    "_fieldsFilled": "2/2",
    "_types": { "quote": "text", "author": "text" },
    "_page": 1
  }
]
```

The run also saves an `EXTRACTION_SCHEMA` to the key-value store, e.g.:

```json
{
  "learnedRules": [
    { "field": "quote", "selector": "span.text", "type": "text", "matches": 10, "confidence": 0.95 },
    { "field": "author", "selector": "small.author", "type": "text", "matches": 10, "confidence": 0.95 }
  ],
  "itemsExtracted": 10,
  "averageConfidence": 0.95
}
```

Point it at a shop instead and label your examples `title:`, `price:`, and `sku:` — you get one row per product with exactly those columns plus `sourceUrl`.

### Filters & options

- **Scrape multiple pages at once** — add several entries to `startUrls` and the rows are combined into one dataset.
- **Name your own columns** — label every example as `label: value` to control the output schema.
- **Cap your results** — set `maxItems` to limit total rows (handy for quick test runs), or `0` for everything.
- **Mix field types on one page** — give a title example and a price example together and they zip into the same rows.

### Pricing

This actor uses **pay-per-result**: you are charged once per extracted row via the `item` event, so you only pay for data you actually get. Runs are free while monetization is unconfigured, and you can cap spend with `maxItems`. Check the actor's Apify Store page for the current per-item rate.

### Use with AI agents & automation

The dataset is plain JSON, so it drops straight into your stack. Call this scraper from an **MCP** server to give AI agents live web-extraction-by-example, or wire it into **Make**, **n8n**, or **Zapier** to trigger runs and route rows to a CRM, database, or **Google Sheets** automatically. Schedule recurring runs to keep a sheet of prices, listings, or leads continuously fresh — no glue code needed.

### Other Flash Scrape scrapers

Need a ready-made scraper for a specific platform? Try the rest of the Flash Scrape suite:

- [Google Maps Leads Scraper](https://apify.com/oriented_wallpaper/google-maps-leads-opener) — Google Maps business leads
- [Yelp Leads Scraper](https://apify.com/oriented_wallpaper/yelp-leads-scraper) — Yelp business leads
- [BBB + Yellow Pages Leads Scraper](https://apify.com/oriented_wallpaper/bbb-yellowpages-leads-scraper) — BBB and Yellow Pages leads
- [Instagram Leads Scraper](https://apify.com/oriented_wallpaper/instagram-leads-scraper) — Instagram profile leads
- [TikTok Leads Scraper](https://apify.com/oriented_wallpaper/tiktok-leads-scraper) — TikTok creator leads
- [YouTube Leads Scraper](https://apify.com/oriented_wallpaper/youtube-leads-scraper) — YouTube creator leads

### FAQ

**Is it legal to scrape websites with this?**
The actor only reads publicly available web content — the same pages anyone can open in a browser. Scrape responsibly, respect each site's terms of service and robots rules, and avoid collecting personal or copyrighted data you are not entitled to use.

**Do I need an API key or any code?**
No. There is no API key and no coding. You paste a URL and a few example values you can see on the page; the scraper learns the pattern for you.

**How many results can I get?**
As many repeating items as the page contains across all your `startUrls`. Set `maxItems` to cap the total, or leave it at `0` for no limit.

**Can I export to CSV, Excel, or Google Sheets?**
Yes. Every run produces a dataset you can download as CSV, JSON, or Excel, or push to Google Sheets via Make, n8n, or Zapier.

**Why didn't my example match?**
Copy an **exact** value from the page's visible text — not from an image, a tooltip, or a dropdown. It also works best when each value sits in its own element (a `<span>` price, an `<h2>` or `<a>` title).

**Can AI agents call this scraper?**
Yes. It exposes a standard Apify run interface, so MCP servers and agent frameworks can invoke it and read the structured rows directly.

***

> Scrapes public web content. Use responsibly and within each site's terms.

# Actor input Schema

## `startUrls` (type: `array`):

Pages to scrape — typically list/category/search pages with repeating items.

## `examples` (type: `array`):

Values you can SEE on the page, one per line. Optionally label them as 'label: value' to name the columns (e.g. 'author: Albert Einstein'). The actor finds every similar item.

## `maxItems` (type: `integer`):

Stop after this many rows across all URLs. Use 0 for no limit.

## `maxPages` (type: `integer`):

Follow 'Next' / rel="next" pagination links up to this many pages per Start URL. 1 = first page only.

## `normalizeValues` (type: `boolean`):

Convert detected numeric / price / percent values into real numbers (instead of strings) in the output.

## `includeMeta` (type: `boolean`):

Add per-row \_confidence (0-1), \_fieldsFilled, \_types, and \_page so you can trust/filter the extraction.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://quotes.toscrape.com/"
    }
  ],
  "examples": [
    "quote: process of our thinking",
    "author: Albert Einstein"
  ],
  "maxItems": 0,
  "maxPages": 1,
  "normalizeValues": false,
  "includeMeta": true
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://quotes.toscrape.com/"
        }
    ],
    "examples": [
        "quote: process of our thinking",
        "author: Albert Einstein"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("oriented_wallpaper/ai-web-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://quotes.toscrape.com/" }],
    "examples": [
        "quote: process of our thinking",
        "author: Albert Einstein",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("oriented_wallpaper/ai-web-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://quotes.toscrape.com/"
    }
  ],
  "examples": [
    "quote: process of our thinking",
    "author: Albert Einstein"
  ]
}' |
apify call oriented_wallpaper/ai-web-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=oriented_wallpaper/ai-web-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Web Scraper - Extract Any Website by Example",
        "description": "AI web scraper that extracts any website by example — paste a URL and a value you see on the page (a price, title, or name) and it learns the HTML pattern and pulls every similar item as structured rows. No CSS selectors, no API key. Export CSV/JSON/Excel.",
        "version": "0.1",
        "x-build-id": "A0rfGLDtBa2t66Nvy"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/oriented_wallpaper~ai-web-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-oriented_wallpaper-ai-web-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/oriented_wallpaper~ai-web-scraper/runs": {
            "post": {
                "operationId": "runs-sync-oriented_wallpaper-ai-web-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/oriented_wallpaper~ai-web-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-oriented_wallpaper-ai-web-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls",
                    "examples"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "Pages to scrape — typically list/category/search pages with repeating items.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "examples": {
                        "title": "Example values",
                        "type": "array",
                        "description": "Values you can SEE on the page, one per line. Optionally label them as 'label: value' to name the columns (e.g. 'author: Albert Einstein'). The actor finds every similar item.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxItems": {
                        "title": "Max items",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Stop after this many rows across all URLs. Use 0 for no limit.",
                        "default": 0
                    },
                    "maxPages": {
                        "title": "Max pages per URL (pagination)",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Follow 'Next' / rel=\"next\" pagination links up to this many pages per Start URL. 1 = first page only.",
                        "default": 1
                    },
                    "normalizeValues": {
                        "title": "Normalize numbers & prices",
                        "type": "boolean",
                        "description": "Convert detected numeric / price / percent values into real numbers (instead of strings) in the output.",
                        "default": false
                    },
                    "includeMeta": {
                        "title": "Include confidence metadata",
                        "type": "boolean",
                        "description": "Add per-row _confidence (0-1), _fieldsFilled, _types, and _page so you can trust/filter the extraction.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
