# AI Website Content Crawler (`ilborso/ai-website-content-crawler`) Actor

A super fast website crawler for AI training

- **URL**: https://apify.com/ilborso/ai-website-content-crawler.md
- **Developed by:** [Fabio Borsotti](https://apify.com/ilborso) (community)
- **Categories:** AI, Agents, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.50 / 1,000 results

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## AI Website Content Crawler

This Apify actor downloads a list of web pages, extracts clean text from each page, and stores one result per URL in the dataset.

The actor is optimized for plain HTTP fetching with concurrent requests, automatic URL normalization, and language preference headers based on the `langs` input field.

### What the actor does

For each input URL, the actor normalizes the address, adding `https://` automatically when the scheme is missing, so values like `example.com` are still processed correctly.[

It then performs HTTP requests in parallel using asynchronous execution controlled by `maxConcurrency`, which helps speed up large batches of URLs.

After downloading the page, the actor parses the HTML with BeautifulSoup, removes non-content elements such as `script`, `style`, `noscript`, `header`, `footer`, `svg`, `img`, `meta`, and `link`, and converts the remaining content into clean plain text.

The actor also reads the `lang` attribute from the HTML document when available and includes it in the output as `htmlLang`.

### Features

- Accepts a list of page URLs.
- Automatically adds `https://` when missing from input URLs.
- Deduplicates normalized URLs before processing.
- Sends preferred language headers using the `langs` array.
- Extracts cleaned plain text from HTML pages.
- Detects the page language from the HTML `lang` attribute when present.
- Processes requests concurrently using `maxConcurrency`.
- Logs progress for each processed URL and prints a final success/error summary.

### Input

```json
{
  "url": [
    "example.com",
    "https://news.ycombinator.com"
  ],
  "maxConcurrency": 20,
  "langs": ["it", "en"]
}
````

#### Input fields

| Field | Type | Required | Description |
|---|---|---:|---|
| `url` | array of strings | Yes | List of page URLs to download and convert into clean plain text. |
| `maxConcurrency` | integer | No | Maximum number of parallel HTTP requests. |
| `langs` | array of strings | No | Preferred languages used to build the `Accept-Language` header for HTTP requests. |

### How language handling works

The actor converts the `langs` array into a standard `Accept-Language` header ordered by priority, so `langs: ["it", "en", "fr"]` becomes a header similar to `it,en;q=0.9,fr;q=0.8`.

This does not guarantee that every website returns content in the requested language, because the final response depends on how each target site handles language negotiation.

### Output

Each dataset item contains the processing result for a single URL.

#### Successful result

```json
{
  "url": "https://example.com",
  "success": true,
  "statusCode": 200,
  "langs": ["it", "en"],
  "acceptLanguage": "it,en;q=0.9",
  "htmlLang": "en",
  "text": "Example Domain\nThis domain is for use in illustrative examples in documents...",
  "error": null
}
```

#### Error result

```json
{
  "url": "https://example.com/missing-page",
  "success": false,
  "statusCode": null,
  "langs": ["it", "en"],
  "acceptLanguage": "it,en;q=0.9",
  "htmlLang": null,
  "text": null,
  "error": "404 Client Error or another request exception"
}
```

### Notes and limitations

- The actor uses plain HTTP requests and does not render JavaScript in a browser, so text that appears only after client-side rendering may be missing.
- The extracted text depends on the HTML returned by the target website and on the site's response to the `Accept-Language` header.
- The actor is intended for HTML pages; non-HTML resources may fail or return unusable output depending on server behavior.

# Actor input Schema

## `url` (type: `array`):

List of page URLs to download and convert into clean plain text.

## `maxConcurrency` (type: `integer`):

Maximum number of parallel HTTP requests.

## `langs` (type: `array`):

Prefered langs.

## Actor input object example

```json
{
  "url": [
    "https://www.ttalbuzzano.it/"
  ],
  "maxConcurrency": 20,
  "langs": [
    "it",
    "en"
  ]
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": [
        "https://www.ttalbuzzano.it/"
    ],
    "langs": [
        "it",
        "en"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("ilborso/ai-website-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "url": ["https://www.ttalbuzzano.it/"],
    "langs": [
        "it",
        "en",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("ilborso/ai-website-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": [
    "https://www.ttalbuzzano.it/"
  ],
  "langs": [
    "it",
    "en"
  ]
}' |
apify call ilborso/ai-website-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ilborso/ai-website-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Website Content Crawler",
        "description": "A super fast website crawler for AI training",
        "version": "0.0",
        "x-build-id": "rezWdy2cYAbRVK5ZR"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ilborso~ai-website-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ilborso-ai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ilborso~ai-website-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-ilborso-ai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ilborso~ai-website-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-ilborso-ai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "url": {
                        "title": "URLs",
                        "uniqueItems": true,
                        "type": "array",
                        "description": "List of page URLs to download and convert into clean plain text.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of parallel HTTP requests.",
                        "default": 20
                    },
                    "langs": {
                        "title": "Languages",
                        "uniqueItems": true,
                        "type": "array",
                        "description": "Prefered langs.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "aa",
                                "ab",
                                "ae",
                                "af",
                                "ak",
                                "am",
                                "ar",
                                "as",
                                "av",
                                "ay",
                                "az",
                                "ba",
                                "be",
                                "bg",
                                "bi",
                                "bm",
                                "bn",
                                "bo",
                                "br",
                                "bs",
                                "ca",
                                "ce",
                                "ch",
                                "co",
                                "cr",
                                "cs",
                                "cu",
                                "cv",
                                "cy",
                                "da",
                                "de",
                                "dv",
                                "dz",
                                "ee",
                                "el",
                                "en",
                                "eo",
                                "es",
                                "et",
                                "eu",
                                "fa",
                                "ff",
                                "fi",
                                "fj",
                                "fo",
                                "fr",
                                "fy",
                                "ga",
                                "gd",
                                "gl",
                                "gn",
                                "gu",
                                "gv",
                                "ha",
                                "he",
                                "hi",
                                "ho",
                                "hr",
                                "ht",
                                "hu",
                                "hy",
                                "hz",
                                "ia",
                                "id",
                                "ie",
                                "ig",
                                "ii",
                                "ik",
                                "io",
                                "is",
                                "it",
                                "iu",
                                "ja",
                                "jv",
                                "ka",
                                "kg",
                                "ki",
                                "kj",
                                "kk",
                                "kl",
                                "km",
                                "kn",
                                "ko",
                                "kr",
                                "ks",
                                "ku",
                                "kv",
                                "kw",
                                "ky",
                                "la",
                                "lb",
                                "lg",
                                "li",
                                "ln",
                                "lo",
                                "lt",
                                "lu",
                                "lv",
                                "mg",
                                "mh",
                                "mi",
                                "mk",
                                "ml",
                                "mn",
                                "mr",
                                "ms",
                                "mt",
                                "my",
                                "na",
                                "nb",
                                "nd",
                                "ne",
                                "ng",
                                "nl",
                                "nn",
                                "no",
                                "nr",
                                "nv",
                                "ny",
                                "oc",
                                "oj",
                                "om",
                                "or",
                                "os",
                                "pa",
                                "pi",
                                "pl",
                                "ps",
                                "pt",
                                "qu",
                                "rm",
                                "rn",
                                "ro",
                                "ru",
                                "rw",
                                "sa",
                                "sc",
                                "sd",
                                "se",
                                "sg",
                                "sh",
                                "si",
                                "sk",
                                "sl",
                                "sm",
                                "sn",
                                "so",
                                "sq",
                                "sr",
                                "ss",
                                "st",
                                "su",
                                "sv",
                                "sw",
                                "ta",
                                "te",
                                "tg",
                                "th",
                                "ti",
                                "tk",
                                "tl",
                                "tn",
                                "to",
                                "tr",
                                "ts",
                                "tt",
                                "tw",
                                "ty",
                                "ug",
                                "uk",
                                "ur",
                                "uz",
                                "ve",
                                "vi",
                                "vo",
                                "wa",
                                "wo",
                                "xh",
                                "yi",
                                "yo",
                                "za",
                                "zh",
                                "zu"
                            ]
                        },
                        "default": [
                            "it",
                            "en"
                        ]
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
