# Website to Markdown - Clean LLM-Ready Content (`ambitious_door/web-to-markdown`) Actor

Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.

- **URL**: https://apify.com/ambitious\_door/web-to-markdown.md
- **Developed by:** [C. K.](https://apify.com/ambitious_door) (community)
- **Categories:** AI
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Website to Markdown — Clean, LLM-Ready Content Extraction

Convert any webpage or website into clean markdown, stripped of navigation, ads, sidebars, and boilerplate. Output drops straight into any RAG pipeline, LLM context window, or vector store without cleanup. Token counts included so you can plan your embedding budget.

### What it does

Most web scrapers give you raw HTML or a wall of unstructured text. You then spend hours cleaning, reformatting, and fixing broken context. This Actor eliminates that step.

Give it a URL. It crawls the site, strips all chrome (navigation, sidebars, footers, cookie banners), and converts each page to clean markdown preserving headings, code blocks, tables, lists, and links. Every page includes a token count (cl100k_base encoding) so you know exactly what it costs to embed or send to an LLM.

### Output format

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | Source URL of the page |
| `title` | string | Page title |
| `content` | string | Clean markdown content |
| `token_count` | integer | Token count (cl100k_base encoding) |
| `content_length` | integer | Character count |
| `meta_description` | string | Page meta description (if available) |

### Input parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `startUrl` | string | — | URL to start crawling from |
| `urls` | array | — | List of specific URLs to convert (batch mode) |
| `maxPages` | integer | `50` | Maximum pages to convert |
| `crawlSameDomain` | boolean | `true` | Stay within the start URL's domain |
| `pathPrefix` | string | `""` | Only crawl paths starting with this prefix |
| `outputFormat` | string | `"markdown"` | `"markdown"` or `"plain_text"` |
| `includeMetadata` | boolean | `true` | Include token count and meta description |

### Example usage

#### Single page

```json
{
    "startUrl": "https://docs.python.org/3/library/asyncio.html",
    "maxPages": 1
}
````

#### Batch conversion

```json
{
    "urls": [
        "https://example.com/page-1",
        "https://example.com/page-2",
        "https://example.com/page-3"
    ],
    "maxPages": 3
}
```

#### Full site crawl

```json
{
    "startUrl": "https://fastapi.tiangolo.com/",
    "maxPages": 100,
    "pathPrefix": "/tutorial/"
}
```

### Pricing

This Actor uses the **pay-per-event** model. You are charged per page successfully converted to markdown. No charge for pages that are skipped (empty, non-content).

### How it works

1. **Crawl** — Crawlee handles the URL queue, deduplication, rate limiting, and robots.txt compliance.
2. **Clean** — Strips navigation, sidebars, footers, cookie banners, and boilerplate using curated selectors. Falls back to `<article>`, `<main>`, or `<body>`.
3. **Convert** — Transforms clean HTML to structured markdown, preserving headings, code blocks, tables, lists, and links.
4. **Count** — Uses `cl100k_base` (GPT-4 / modern embedding encoding) for accurate token counts.

### Responsible use

- This Actor **respects `robots.txt`** by default (enforced by Crawlee).
- Crawlee's built-in autoscaling keeps request rates reasonable.
- **You are responsible** for ensuring your use complies with the target site's Terms of Service.

### Built with

- [Crawlee](https://crawlee.dev/python/) for reliable crawling
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) for HTML parsing
- [tiktoken](https://github.com/openai/tiktoken) for token counting

# Actor input Schema

## `startUrl` (type: `string`):

The URL to start crawling from.

## `urls` (type: `array`):

List of specific URLs to convert. Use this instead of or alongside startUrl for batch conversion.

## `maxPages` (type: `integer`):

Maximum number of pages to crawl and convert.

## `crawlSameDomain` (type: `boolean`):

Only crawl pages on the same domain as the start URL.

## `pathPrefix` (type: `string`):

Only crawl URLs whose path starts with this prefix (e.g. /docs/).

## `outputFormat` (type: `string`):

Output as markdown (preserves structure) or plain text.

## `includeMetadata` (type: `boolean`):

Include token count, content length, and meta description in output.

## Actor input object example

```json
{
  "startUrl": "https://docs.python.org/3/library/asyncio.html",
  "maxPages": 50,
  "crawlSameDomain": true,
  "outputFormat": "markdown",
  "includeMetadata": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://docs.python.org/3/library/asyncio.html"
};

// Run the Actor and wait for it to finish
const run = await client.actor("ambitious_door/web-to-markdown").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrl": "https://docs.python.org/3/library/asyncio.html" }

# Run the Actor and wait for it to finish
run = client.actor("ambitious_door/web-to-markdown").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://docs.python.org/3/library/asyncio.html"
}' |
apify call ambitious_door/web-to-markdown --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ambitious_door/web-to-markdown",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website to Markdown - Clean LLM-Ready Content",
        "description": "Convert any webpage into clean markdown stripped of navigation, ads, and boilerplate. Perfect for RAG pipelines, LLM context, and content extraction. Token counts included.",
        "version": "0.1",
        "x-build-id": "MJ5XOeMeKOypXthBm"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ambitious_door~web-to-markdown/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ambitious_door-web-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ambitious_door~web-to-markdown/runs": {
            "post": {
                "operationId": "runs-sync-ambitious_door-web-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ambitious_door~web-to-markdown/run-sync": {
            "post": {
                "operationId": "run-sync-ambitious_door-web-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "The URL to start crawling from."
                    },
                    "urls": {
                        "title": "URL List",
                        "type": "array",
                        "description": "List of specific URLs to convert. Use this instead of or alongside startUrl for batch conversion.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxPages": {
                        "title": "Max Pages",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl and convert.",
                        "default": 50
                    },
                    "crawlSameDomain": {
                        "title": "Stay on Same Domain",
                        "type": "boolean",
                        "description": "Only crawl pages on the same domain as the start URL.",
                        "default": true
                    },
                    "pathPrefix": {
                        "title": "Path Prefix",
                        "type": "string",
                        "description": "Only crawl URLs whose path starts with this prefix (e.g. /docs/)."
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "markdown",
                            "plain_text"
                        ],
                        "type": "string",
                        "description": "Output as markdown (preserves structure) or plain text.",
                        "default": "markdown"
                    },
                    "includeMetadata": {
                        "title": "Include Metadata",
                        "type": "boolean",
                        "description": "Include token count, content length, and meta description in output.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
