# Website to Markdown Crawler - Full-Site Text for LLMs & RAG (`entranced_gelato/website-to-markdown-crawler`) Actor

Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.

- **URL**: https://apify.com/entranced\_gelato/website-to-markdown-crawler.md
- **Developed by:** [AIDevs](https://apify.com/entranced_gelato) (community)
- **Categories:** AI, Developer tools, Agents
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.00 / 1,000 page crawleds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🕸️ Website to Markdown Crawler

[![Website to Markdown Crawler](https://apify.com/actor-badge?actor=entranced_gelato/website-to-markdown-crawler)](https://apify.com/entranced_gelato/website-to-markdown-crawler)

**Crawl any website from a single start URL and get every page back as clean text + Markdown — ready for LLMs, RAG pipelines, and AI agents.** No config, no selectors, no headless-browser setup. Give it a URL, it follows the internal links, strips the navigation and ads, and returns one structured record per page.

This is the bulk, whole-site companion to the single-page [AI Web Page Reader](https://apify.com/entranced_gelato/ai-web-page-reader). Point it at a docs site, a blog, a knowledge base, or a marketing site and turn the entire thing into LLM-ready Markdown in one run.

---

### 🤔 What can Website to Markdown Crawler do?

- 🔗 **Follow internal links automatically** — breadth-first crawl from your start URL, with depth and page limits you control.
- 🧹 **Return clean content** — removes nav bars, headers, footers, cookie banners, scripts, and ads, keeping the real page text.
- 📝 **Output Markdown + plain text** — headings, lists, links, and emphasis preserved as Markdown; plain text for simple ingestion.
- 🏠 **Stay on-domain** — same-domain scoping by default so you don't wander off into the wider web.
- 🗂️ **One record per page** — title, description, word count, links found, depth, content, and Markdown for every page.
- ⚙️ **Run with zero configuration** — sensible defaults; the only required field is the start URL.

### 📊 What data do I get?

For every page crawled, you get one dataset record:

| Field | Description |
|-------|-------------|
| `url` | The page URL that was crawled. |
| `depth` | How many links deep from the start URL (start = 0). |
| `title` | Page title. |
| `description` | Meta description, if present. |
| `wordCount` | Word count of the extracted text. |
| `content` | Clean plain text of the main content. |
| `markdown` | LLM-ready Markdown version (when enabled). |
| `linksFound` | Number of links discovered on the page. |
| `crawledAt` | ISO timestamp of when the page was crawled. |

### 💰 How much will it cost?

This Actor uses **pay-per-event** pricing — you only pay for what you crawl, and platform/compute usage is included (no surprise infrastructure bill):

- **Page crawled — $1.00 per 1,000 pages** (the primary event). One charge per page successfully returned to the dataset.
- **Actor start — $0.00005** (a negligible per-run fee).
- **Platform usage / compute — included.**

A 25-page docs site costs about **$0.025**. A 1,000-page site costs about **$1.00**. You set `maxPages`, so your spend is always capped by your own limit. Unlike compute-metered crawlers, your price is flat and predictable — you always know the cost before you run.

### 🚀 How do I use Website to Markdown Crawler?

1. Create a free [Apify account](https://console.apify.com/sign-up) (new accounts get free monthly usage credits).
2. Open the Actor and paste a website into **Start URL** (e.g. your docs or blog homepage).
3. Optionally set **Max pages**, **Max depth**, and whether to **stay on the same domain**.
4. Click **Start** and watch pages stream into the dataset.
5. Export the results as **JSON, CSV, or Excel**, or pull them via the API.

That's it — no proxies, browsers, or selectors to configure.

### ⬇️ Input

Configure the crawl from the Console **Input** tab or via the API. The only required field is `startUrl`.

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `startUrl` | string | **Yes** | — | The page to start crawling from. |
| `maxPages` | integer | No | `25` | Maximum number of pages to crawl. |
| `maxDepth` | integer | No | `3` | How many links deep to follow from the start URL. |
| `sameDomainOnly` | boolean | No | `true` | Only follow links on the start URL's domain. |
| `includeMarkdown` | boolean | No | `true` | Also return a Markdown version of each page. |
| `maxCharsPerPage` | integer | No | `0` | Cap the text/Markdown length per page (`0` = no limit). |

#### Example input

```json
{
  "startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "maxPages": 25,
  "maxDepth": 3,
  "sameDomainOnly": true,
  "includeMarkdown": true,
  "maxCharsPerPage": 0
}
````

### ⬆️ Output

The Actor pushes one record per page to the dataset. In the Console **Output** tab you get a clean table (URL, depth, title, word count, links found, crawled-at); via API you get JSON/CSV/Excel. Example record:

```json
{
  "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
  "depth": 0,
  "title": "Web scraping for beginners",
  "description": "Learn the basics of web scraping with a step-by-step course.",
  "wordCount": 642,
  "content": "Web scraping for beginners\n\nThis course teaches you...",
  "markdown": "# Web scraping for beginners\n\nThis course teaches you...",
  "linksFound": 38,
  "crawledAt": "2026-06-30T09:42:11.004Z"
}
```

### 🎯 Use cases

- 📚 **RAG knowledge bases** — ingest an entire docs site or help center into a vector database in one run.
- 🤖 **AI agents** — give an agent a clean Markdown snapshot of a whole site instead of one page at a time.
- 🔁 **Content migrations** — pull a site's pages into Markdown for a new CMS or static-site generator.
- 🧠 **LLM fine-tuning / context** — build a clean text corpus from a domain you control.
- 🔍 **Site audits** — get titles, word counts, and link counts for every page to spot thin or orphaned content.

### 🔌 Integrations & code examples

#### Call it from the API

```bash
curl "https://api.apify.com/v2/acts/entranced_gelato~website-to-markdown-crawler/run-sync-get-dataset-items?token=<APIFY_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{ "startUrl": "https://example.com", "maxPages": 50 }'
```

#### Python (Apify client)

```python
from apify_client import ApifyClient

client = ApifyClient("<APIFY_TOKEN>")
run = client.actor("entranced_gelato/website-to-markdown-crawler").call(
    run_input={"startUrl": "https://docs.example.com", "maxPages": 100}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], "->", len(item["markdown"]), "chars")
```

#### LangChain (RAG ingestion)

```python
from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper()
loader = apify.call_actor(
    actor_id="entranced_gelato/website-to-markdown-crawler",
    run_input={"startUrl": "https://docs.example.com", "maxPages": 100},
    dataset_mapping_function=lambda item: Document(
        page_content=item["markdown"] or item["content"] or "",
        metadata={"source": item["url"], "title": item.get("title")},
    ),
)
docs = loader.load()  # feed straight into a vector store
```

#### MCP — add it to Claude, Cursor, or any agent

The Actor is exposed over the **Model Context Protocol**, so AI agents can call it as a tool. Point your MCP client at Apify's MCP server:

```json
{
  "mcpServers": {
    "apify": {
      "command": "npx",
      "args": ["-y", "@apify/actors-mcp-server", "--actors", "entranced_gelato/website-to-markdown-crawler"],
      "env": { "APIFY_TOKEN": "<APIFY_TOKEN>" }
    }
  }
}
```

Also integrates with **LlamaIndex**, **Make**, **Zapier**, and **n8n** — start a crawl from any flow and route the clean output anywhere.

### 🧰 Want more? Pair it with the rest of the suite

- 📄 **[AI Web Page Reader](https://apify.com/entranced_gelato/ai-web-page-reader)** — read a single URL into clean text + Markdown (the per-page version of this crawler).
- 📑 **[AI Document Reader](https://apify.com/entranced_gelato/ai-document-reader)** — turn a PDF, DOCX, TXT, or HTML document into LLM-ready text + Markdown.
- 🧭 **[AI Competitive Brief Generator](https://apify.com/entranced_gelato/ai-competitive-brief-generator)** — turn a competitor or prospect URL into a structured competitive, SEO, or sales brief.

### ❓ FAQ

**Is web crawling legal?** Crawling publicly available pages is generally legal, but how you use the data may be subject to the target site's terms and applicable laws (copyright, privacy, etc.). This Actor reads only public pages and respects the limits you set. You are responsible for how you use the output — when in doubt, consult a lawyer.

**Does it run JavaScript-heavy sites?** It fetches server-rendered HTML. Pages that render entirely client-side may return limited content; for those, a headless-browser crawler is a better fit.

**How do I keep costs down?** Set `maxPages` and `maxDepth`. The crawler stops as soon as it hits either limit, so your spend is capped by your own configuration.

**Will it leave the site I gave it?** Not by default — `sameDomainOnly` is `true`, so it only follows links on the start URL's domain. Turn it off to follow external links too.

**Markdown or plain text?** Both. `markdown` keeps formatting for LLMs; `content` is clean plain text. Set `includeMarkdown: false` if you only want plain text.

**Can I call it from my AI agent?** Yes — it's exposed over MCP and the Apify API, so agents and automations can invoke it as a tool.

***

*Built for AI engineers, RAG/LLM developers, and automation builders who need a fast, reliable "website → Markdown" primitive.*

# Actor input Schema

## `startUrl` (type: `string`):

The website/section to crawl. The crawler follows internal links from here.

## `maxPages` (type: `integer`):

Maximum number of pages to crawl.

## `maxDepth` (type: `integer`):

How many link-hops from the start URL to follow (0 = only the start page).

## `sameDomainOnly` (type: `boolean`):

Only follow links on the same domain as the start URL.

## `includeMarkdown` (type: `boolean`):

Include a clean Markdown version of each page (ideal for LLMs).

## `maxCharsPerPage` (type: `integer`):

Optionally cap the length of content/markdown per page.

## Actor input object example

```json
{
  "startUrl": "https://example.com/docs",
  "maxPages": 25,
  "maxDepth": 3,
  "sameDomainOnly": true,
  "includeMarkdown": true,
  "maxCharsPerPage": 0
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners"
};

// Run the Actor and wait for it to finish
const run = await client.actor("entranced_gelato/website-to-markdown-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners" }

# Run the Actor and wait for it to finish
run = client.actor("entranced_gelato/website-to-markdown-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://docs.apify.com/academy/web-scraping-for-beginners"
}' |
apify call entranced_gelato/website-to-markdown-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=entranced_gelato/website-to-markdown-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website to Markdown Crawler - Full-Site Text for LLMs & RAG",
        "description": "Crawl any website from a start URL and get every page as clean text + Markdown for LLMs, RAG, and AI agents. Follows internal links with depth and page limits, strips nav and ads, and returns one structured record per page. A fast, no-config site-to-Markdown crawler.",
        "version": "0.0",
        "x-build-id": "nFwS7croASEQLnZEs"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/entranced_gelato~website-to-markdown-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-entranced_gelato-website-to-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/entranced_gelato~website-to-markdown-crawler/runs": {
            "post": {
                "operationId": "runs-sync-entranced_gelato-website-to-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/entranced_gelato~website-to-markdown-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-entranced_gelato-website-to-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "The website/section to crawl. The crawler follows internal links from here."
                    },
                    "maxPages": {
                        "title": "Max pages",
                        "minimum": 1,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl.",
                        "default": 25
                    },
                    "maxDepth": {
                        "title": "Max depth",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "How many link-hops from the start URL to follow (0 = only the start page).",
                        "default": 3
                    },
                    "sameDomainOnly": {
                        "title": "Same domain only",
                        "type": "boolean",
                        "description": "Only follow links on the same domain as the start URL.",
                        "default": true
                    },
                    "includeMarkdown": {
                        "title": "Return Markdown",
                        "type": "boolean",
                        "description": "Include a clean Markdown version of each page (ideal for LLMs).",
                        "default": true
                    },
                    "maxCharsPerPage": {
                        "title": "Max characters per page (0 = no limit)",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Optionally cap the length of content/markdown per page.",
                        "default": 0
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```