# Deep Website Crawler (DEPRECATED) (`santamaria-automations/deep-website-crawler`) Actor

DEPRECATED — use santamaria-automations/website-content-crawler instead. Same crawl behavior, richer output (clean AI/RAG-ready Markdown vs plain text).

- **URL**: https://apify.com/santamaria-automations/deep-website-crawler.md
- **Developed by:** [Ale](https://apify.com/santamaria-automations) (community)
- **Categories:** Automation, Lead generation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Deep Website Crawler

Crawl any website to configurable depth and extract the title and full text content of every page. Give it a list of start URLs — it follows links level by level and returns one record per page. No API keys or login required.

### How It Works

For each start URL you provide, the crawler:

1. Fetches the start page
2. Extracts all internal links from that page
3. Follows those links to the next depth level
4. Repeats until the configured depth or page limit is reached
5. Returns one record per crawled page with its title, text content, and crawl depth

Challenge pages (bot-protection walls) are skipped automatically so the run keeps going. Pages that return errors are logged and skipped.

### Use with AI Agents (MCP)

Connect this actor to any MCP-compatible AI client — Claude Desktop, Claude.ai, Cursor, VS Code, LangChain, LlamaIndex, or custom agents.

**Apify MCP server URL:**

````

https://mcp.apify.com?tools=santamaria-automations/deep-website-crawler

````

**Example prompt once connected:**

> "Use `deep-website-crawler` to crawl https://example.com to depth 2 and return all page titles and text as a table."

Clients that support dynamic tool discovery (Claude.ai, VS Code) will receive the full input schema automatically via `add-actor`.

### Input Example

```json
{
  "startUrls": [
    "https://acme-corp.com",
    "https://www.another-company.de/blog"
  ],
  "maxDepth": 2,
  "maxPagesPerCrawl": 100,
  "maxPagesPerDomain": 50
}
````

Both bare domains (`acme-corp.com`) and full URLs (`https://acme-corp.com/about`) are accepted.

### Output Example

```json
[
  {
    "url": "https://acme-corp.com",
    "title": "Acme Corp - Industrial Solutions",
    "text": "Acme Corp is a global leader in industrial solutions. Since 1950 we have...",
    "depth": 0,
    "start_url": "https://acme-corp.com",
    "links_found": 14,
    "status_code": 200,
    "content_length": 3842,
    "scraped_at": "2026-04-29T10:00:00Z"
  },
  {
    "url": "https://acme-corp.com/about",
    "title": "About Us - Acme Corp",
    "text": "Founded in 1950, Acme Corp has grown from a small family workshop into...",
    "depth": 1,
    "start_url": "https://acme-corp.com",
    "links_found": 8,
    "status_code": 200,
    "content_length": 2190,
    "scraped_at": "2026-04-29T10:00:01Z"
  }
]
```

### Pricing

You pay per page crawled — only charged for pages you actually receive.

| Event | Price | Description |
|-------|-------|-------------|
| Actor start | $0.25 | Covers container startup |
| Page result | $0.0005 | Per page crawled and returned |

**Example costs:**

| Pages crawled | Cost |
|--------------|------|
| 0 pages | $0.25 |
| 100 pages | $0.30 |
| 1,000 pages | $0.75 |
| 10,000 pages | $5.25 |

No monthly fees. No minimum spend.

### Input Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `startUrls` | string\[] | required | URLs to start crawling from |
| `maxDepth` | integer | 2 | Link levels deep to follow (0–5) |
| `maxPagesPerCrawl` | integer | 100 | Max total pages across all start URLs (1–500) |
| `maxPagesPerDomain` | integer | 50 | Max pages per unique domain (1–250) |
| `proxyConfiguration` | object | Apify proxy | Proxy settings |

### Output Fields

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | Canonical URL of the crawled page |
| `title` | string | HTML title tag content |
| `text` | string | Visible plain text (truncated at 10,000 characters) |
| `depth` | integer | Crawl depth (0 = start URL, 1 = one link away, etc.) |
| `start_url` | string | The start URL that initiated this crawl path |
| `links_found` | integer | Internal links discovered on this page |
| `status_code` | integer | HTTP status code |
| `content_length` | integer | Characters in extracted text (before truncation) |
| `scraped_at` | string | ISO 8601 UTC timestamp |

### Tips

- **Depth 2 covers most websites** — homepage → section pages → detail pages is typically enough for site audits and content extraction
- **Use maxPagesPerCrawl for budget control** — set this lower than the theoretical maximum to cap spend on large sites
- **Depth 0 is just the start page** — useful when you have a precise list of URLs and only need content extraction without following links
- **One record per page** — each unique URL gets its own row, making it easy to filter, sort, or feed into downstream processing

### Related Actors

- [Free Email Domain Scraper](https://apify.com/santamaria-automations/free-email-domain-scraper) — extract email addresses from any domain
- [Website Contact Extractor](https://apify.com/santamaria-automations/website-contact-extractor) — extract full contact records (email + phone + social + address)
- [SEO Metadata Extractor](https://apify.com/santamaria-automations/seo-metadata-extractor) — extract meta title, description, canonical, and OG tags

### Issues & Feature Requests

If something is not working or you're missing a feature, please [open an issue](https://console.apify.com/actors/deep-website-crawler/issues) and we'll look into it.

# Actor input Schema

## `startUrls` (type: `array`):

List of URLs where the crawler will start. Each URL is crawled to the configured depth. Both bare domains (example.com) and full URLs (https://example.com/blog) are accepted.

## `maxDepth` (type: `integer`):

How many link levels deep to crawl from each start URL. 0 = start page only, 1 = start page + its links, 2 = two levels deep, etc.

## `maxPagesPerCrawl` (type: `integer`):

Maximum total pages to crawl across all start URLs combined. Useful for budget control on large sites.

## `maxPagesPerDomain` (type: `integer`):

Maximum pages to crawl per unique domain. Caps how deep any single domain can be crawled.

## `proxyConfiguration` (type: `object`):

Optional Apify proxy. Most public websites are reachable without proxy; enable only for sites that block direct traffic.

## Actor input object example

```json
{
  "startUrls": [
    "https://example.com"
  ],
  "maxDepth": 2,
  "maxPagesPerCrawl": 100,
  "maxPagesPerDomain": 50,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `defaultDataset` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://example.com"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("santamaria-automations/deep-website-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrls": ["https://example.com"] }

# Run the Actor and wait for it to finish
run = client.actor("santamaria-automations/deep-website-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://example.com"
  ]
}' |
apify call santamaria-automations/deep-website-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=santamaria-automations/deep-website-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Deep Website Crawler (DEPRECATED)",
        "description": "DEPRECATED — use santamaria-automations/website-content-crawler instead. Same crawl behavior, richer output (clean AI/RAG-ready Markdown vs plain text).",
        "version": "1.0",
        "x-build-id": "uMj4yaNhe5KTpbkQ9"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/santamaria-automations~deep-website-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-santamaria-automations-deep-website-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/santamaria-automations~deep-website-crawler/runs": {
            "post": {
                "operationId": "runs-sync-santamaria-automations-deep-website-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/santamaria-automations~deep-website-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-santamaria-automations-deep-website-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "List of URLs where the crawler will start. Each URL is crawled to the configured depth. Both bare domains (example.com) and full URLs (https://example.com/blog) are accepted.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxDepth": {
                        "title": "Max Crawl Depth",
                        "minimum": 0,
                        "maximum": 5,
                        "type": "integer",
                        "description": "How many link levels deep to crawl from each start URL. 0 = start page only, 1 = start page + its links, 2 = two levels deep, etc.",
                        "default": 2
                    },
                    "maxPagesPerCrawl": {
                        "title": "Max Pages Per Crawl",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum total pages to crawl across all start URLs combined. Useful for budget control on large sites.",
                        "default": 100
                    },
                    "maxPagesPerDomain": {
                        "title": "Max Pages Per Domain",
                        "minimum": 1,
                        "maximum": 250,
                        "type": "integer",
                        "description": "Maximum pages to crawl per unique domain. Caps how deep any single domain can be crawled.",
                        "default": 50
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Optional Apify proxy. Most public websites are reachable without proxy; enable only for sites that block direct traffic.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
