# LLM Markdown Crawler (`sleek_waveform/llm-markdown-crawler`) Actor

Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.

- **URL**: https://apify.com/sleek\_waveform/llm-markdown-crawler.md
- **Developed by:** [Daniel Dimitrov](https://apify.com/sleek_waveform) (community)
- **Categories:** AI
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $5.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## LLM Markdown Crawler

Turn any website into clean, structured Markdown ready for Large Language Models, RAG pipelines, and AI training datasets. LLM Markdown Crawler uses Mozilla's Readability algorithm to strip away navigation, ads, and boilerplate — leaving you with pure content that LLMs can consume directly.

### What does LLM Markdown Crawler do?

LLM Markdown Crawler is an Apify Actor that crawls any website and converts its pages into pristine Markdown optimized for LLM and AI workflows. It uses CheerioCrawler (no headless browser) for blazing-fast, low-cost extraction. LLM Markdown Crawler can extract:

- **Clean Markdown content** — navigation, footers, sidebars, and ads are stripped automatically
- **Page metadata** — title, author, excerpt, and word count for each page
- **Deep crawl support** — follow links up to a configurable depth with URL glob filtering
- **Structured output** — each page becomes a single JSON record ready for vector databases, fine-tuning, or RAG ingestion

### Why scrape websites for LLM data?

The web contains billions of pages of high-quality content — documentation, blog posts, research articles, and knowledge bases. Converting this content to clean Markdown is essential for modern AI workflows.

Here are just some of the ways you could use that data:

- **RAG pipelines** — build retrieval-augmented generation systems with clean, chunked content from any domain
- **LLM fine-tuning** — create high-quality training datasets from curated websites and documentation
- **Knowledge base archiving** — capture and preserve website content in a portable, searchable format
- **Content analysis** — analyze writing patterns, topics, and structure across large content collections
- **Documentation extraction** — convert technical docs, wikis, and help centers into Markdown for internal tools

If you would like more inspiration on how scraping websites for LLM data could help your business, check out our [industry pages](https://apify.com/industries).

### How to scrape websites for LLM data

1. Click on **Try for free**.
2. Enter one or more website URLs in the `startUrls` field (e.g., `https://docs.example.com/`).
3. Set `maxRequestsPerCrawl`, `maxDepth`, and optionally add URL `globs` to focus the crawl.
4. Click on **Run**.
5. When LLM Markdown Crawler has finished, preview or download your data from the **Dataset** tab.

### How much will it cost to scrape websites for LLM data?

Apify gives you $5 free usage credits every month on the [Apify Free plan](https://apify.com/pricing). Because this Actor uses CheerioCrawler with no browser overhead, you can crawl approximately **10,000 pages per month** for that, so those 10,000 pages will be completely free!

But if you need to get more data regularly from websites, you should grab an Apify subscription. We recommend our [$49/month Personal plan](https://apify.com/pricing) — you can get up to **100,000 pages every month** with the $49 monthly plan!

Or get **1,000,000+ pages** for $499 with the [Team plan](https://apify.com/pricing) — wow!

### Input parameters for LLM Markdown Crawler

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `startUrls` | Array | ✅ | — | List of URLs to start crawling |
| `maxRequestsPerCrawl` | Integer | ❌ | 100 | Maximum number of pages to process per crawl |
| `maxDepth` | Integer | ❌ | 1 | How many link levels deep to follow from start URLs |
| `globs` | Array | ❌ | `[]` | URL glob patterns to restrict which pages are crawled |
| `includeMetadata` | Boolean | ❌ | `true` | Whether to extract author, excerpt, and word count metadata |

### Output from LLM Markdown Crawler

Each crawled page is stored as a JSON record in the Actor's dataset:

```json
{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started Guide",
  "markdown": "## Getting Started\n\nThis guide walks you through setting up...",
  "excerpt": "This guide walks you through setting up the platform in under 5 minutes.",
  "author": "Jane Smith",
  "wordCount": 1247
}
````

| Field | Type | Description |
|-------|------|-------------|
| `url` | String | The page URL that was crawled |
| `title` | String | Page title extracted from the document |
| `markdown` | String | Clean Markdown content with boilerplate removed |
| `excerpt` | String | Short summary of the page content (when metadata enabled) |
| `author` | String | Author name if detected in the page (when metadata enabled) |
| `wordCount` | Number | Estimated word count of the extracted content |

### Tips for scraping websites for LLM data

- **Use URL globs** to restrict crawling to relevant sections (e.g., `["https://docs.example.com/guide/**"]`) and avoid irrelevant pages
- **Start with `maxDepth: 1`** to preview results before running deeper crawls — this saves credits and lets you validate output quality
- **Set `includeMetadata: false`** if you only need the Markdown content, for a slight speed improvement
- **Combine multiple start URLs** to build datasets spanning multiple websites in a single run

### Is it legal to scrape websites for LLM data?

Note that personal data is protected by GDPR in the European Union and by other regulations around the world. You should not scrape personal data unless you have a legitimate reason to do so. If you're unsure whether your reason is legitimate, consult your lawyers. We also recommend that you read our blog post: [is web scraping legal?](https://blog.apify.com/is-web-scraping-legal/)

### Webhook Integration

Pass an optional `webhookUrl` in the input to receive a POST notification when the run finishes:

```json
{
  "webhookUrl": "https://your-server.com/webhook"
}
```

**Payload sent by Apify:**

```json
{
  "eventType": "ACTOR.RUN.SUCCEEDED",
  "eventData": { "actorId": "...", "actorRunId": "..." },
  "resource": { "id": "...", "status": "SUCCEEDED", "defaultDatasetId": "..." }
}
```

The webhook fires on `SUCCEEDED`, `FAILED`, `TIMED_OUT`, and `ABORTED` events.
Use it to trigger downstream pipelines, Zapier, Make.com, or any HTTP endpoint.

# Actor input Schema

## `startUrls` (type: `array`):

List of URLs to start crawling.

## `maxRequestsPerCrawl` (type: `integer`):

Maximum number of pages to crawl.

## `maxDepth` (type: `integer`):

How deep to follow links (0 for start URLs only).

## `globs` (type: `array`):

Patterns to restrict crawling (e.g. https://example.com/blog/\*\*). Use carefully.

## `includeMetadata` (type: `boolean`):

Extract token estimates, author, excerpt, and date.

## `webhookUrl` (type: `string`):

Optional URL to receive a POST notification when the run completes. The payload includes actorRunId, datasetId, status, and a direct link to results.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ],
  "maxRequestsPerCrawl": 50,
  "maxDepth": 2,
  "globs": [],
  "includeMetadata": true
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://apify.com"
        }
    ],
    "maxRequestsPerCrawl": 50,
    "maxDepth": 2
};

// Run the Actor and wait for it to finish
const run = await client.actor("sleek_waveform/llm-markdown-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://apify.com" }],
    "maxRequestsPerCrawl": 50,
    "maxDepth": 2,
}

# Run the Actor and wait for it to finish
run = client.actor("sleek_waveform/llm-markdown-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ],
  "maxRequestsPerCrawl": 50,
  "maxDepth": 2
}' |
apify call sleek_waveform/llm-markdown-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=sleek_waveform/llm-markdown-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "LLM Markdown Crawler",
        "description": "Crawl any website and extract clean, boilerplate-free Markdown optimized for LLMs, RAG pipelines, and AI training datasets. Uses Mozilla Readability to strip navigation and ads, then converts to clean Markdown. No browser required — fast and cheap.",
        "version": "1.0",
        "x-build-id": "a58bZVxI9pAXQF8KT"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/sleek_waveform~llm-markdown-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-sleek_waveform-llm-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/sleek_waveform~llm-markdown-crawler/runs": {
            "post": {
                "operationId": "runs-sync-sleek_waveform-llm-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/sleek_waveform~llm-markdown-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-sleek_waveform-llm-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "List of URLs to start crawling.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxRequestsPerCrawl": {
                        "title": "Max Requests",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl.",
                        "default": 100
                    },
                    "maxDepth": {
                        "title": "Max Crawl Depth",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "How deep to follow links (0 for start URLs only).",
                        "default": 1
                    },
                    "globs": {
                        "title": "URL Glob Patterns",
                        "type": "array",
                        "description": "Patterns to restrict crawling (e.g. https://example.com/blog/**). Use carefully.",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    },
                    "includeMetadata": {
                        "title": "Include Metadata",
                        "type": "boolean",
                        "description": "Extract token estimates, author, excerpt, and date.",
                        "default": true
                    },
                    "webhookUrl": {
                        "title": "Webhook URL",
                        "type": "string",
                        "description": "Optional URL to receive a POST notification when the run completes. The payload includes actorRunId, datasetId, status, and a direct link to results."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
