# Website to Markdown & Text Crawler — AI / RAG Data (`logiover/website-text-markdown-crawler`) Actor

Crawl an entire website and extract clean, boilerplate-free main content as Markdown and plain text — ready for LLM training, RAG pipelines, embeddings and AI agents. No login, no browser, one row per page.

- **URL**: https://apify.com/logiover/website-text-markdown-crawler.md
- **Developed by:** [Logiover](https://apify.com/logiover) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $4.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Website to Markdown & Text Crawler — AI, RAG & LLM Data 📄

**Turn any website into clean Markdown and plain text for AI.** This **website content crawler** crawls an entire site, strips away navigation, headers, footers, ads and scripts, and exports the **boilerplate-free main content** of every page as **Markdown** and **plain text** — ready to feed straight into **LLM training sets, RAG pipelines, embeddings, vector databases and AI agents**.

Give it one URL — it discovers and extracts every page automatically. **No login, no headless browser, one clean row per page.**

> Looking to **scrape a website for an LLM**, **convert HTML to Markdown**, build **RAG data**, or **extract text from a website** at scale? That's exactly what this actor does.

---

### ✨ Key features

- 🕷️ **Full-site crawl** — start from one URL and follow internal links across the whole domain.
- 📝 **Clean Markdown + plain text** — main content only, with nav/header/footer/sidebar/scripts removed.
- 🔗 **Absolute links & images** — relative URLs are rewritten to absolute, so the Markdown is portable.
- 🧠 **Built for AI / RAG / LLM** — chunk-ready output for embeddings, fine-tuning and retrieval.
- 🏷️ **Rich page metadata** — title, meta description, H1, language, canonical and word count.
- ⚡ **Fast & cheap** — pure HTTP, no browser, high concurrency.

### 💡 Use cases

- **RAG & knowledge bases** — turn docs, blogs and help centers into clean Markdown chunks for retrieval-augmented generation.
- **LLM fine-tuning datasets** — collect high-quality text at scale from any set of websites.
- **AI agents & chatbots** — feed your agent fresh, structured website content.
- **Content migration & archiving** — export an entire website to Markdown.
- **Semantic search & embeddings** — generate clean text to embed into a vector database (Pinecone, Weaviate, pgvector, …).

### 📦 What you get

One row per crawled page:

| Field | Description |
|-------|-------------|
| `url` | Page URL |
| `title` | Page title |
| `metaDescription` | Meta description |
| `h1` | First H1 heading |
| `lang` | Page language |
| `canonical` | Canonical URL |
| `wordCount` | Word count of the main content |
| `text` | Clean main-content text (boilerplate removed) |
| `markdown` | The same content converted to Markdown |
| `html` | Cleaned main-content HTML (optional) |
| `crawledAt` | ISO 8601 timestamp |

#### Example output

```json
{
  "url": "https://docs.example.com/getting-started",
  "title": "Getting Started",
  "metaDescription": "Set up the SDK in 5 minutes.",
  "h1": "Getting Started",
  "wordCount": 812,
  "text": "Getting Started Install the package...",
  "markdown": "## Getting Started\n\nInstall the package...",
  "crawledAt": "2026-05-25T14:13:00.000Z"
}
````

### 🚀 How to use it

1. Click **Try for free / Start**.
2. Paste one or more website URLs into **Start URLs**.
3. (Optional) Set **Max pages to crawl** — use `0` to crawl the whole site.
4. (Optional) Toggle **Save Markdown**, **Save plain text**, **Save HTML**.
5. Click **Save & Start**.
6. Export your dataset as **JSON, CSV, Excel or via API**, or pull it straight into your AI pipeline.

### ⚙️ Input

| Option | Description | Default |
|--------|-------------|---------|
| `startUrls` | Websites to crawl | – (required) |
| `maxPagesToCrawl` | Max pages per run (`0` = whole site) | `1000` |
| `saveMarkdown` | Include Markdown output | `true` |
| `saveText` | Include plain-text output | `true` |
| `saveHtml` | Include cleaned main-content HTML | `false` |
| `maxConcurrency` | Parallel requests | `10` |

#### Example input

```json
{
  "startUrls": [{ "url": "https://docs.apify.com" }],
  "maxPagesToCrawl": 2000,
  "saveMarkdown": true,
  "saveText": true
}
```

### 🔍 How it works

The crawler follows internal links within the same domain as your Start URLs. For each page it removes scripts, styles, navigation, headers, footers and sidebars, isolates the main content (`<main>` / `<article>` / body), rewrites relative links and images to absolute URLs, and exports the result as clean **text** and **Markdown**. It's pure HTTP — fast and cheap, with no headless browser.

### 🧰 Tips & best practices

- Set `maxPagesToCrawl` to `0` to capture an entire site for a knowledge base.
- Keep `saveText` and `saveMarkdown` on for maximum flexibility downstream; turn on `saveHtml` if you need raw HTML.
- Use the `wordCount` field to filter out thin pages before embedding.
- Lower `maxConcurrency` if a site rate-limits you.

### ❓ FAQ

**Does it render JavaScript?** No — it parses server-rendered HTML, which keeps runs fast and cheap and works for the large majority of websites and documentation sites.

**Is the Markdown clean enough for RAG?** Yes — navigation, headers, footers, ads and scripts are stripped, and links/images are absolute, so the output is ready to chunk and embed.

**How do I crawl the whole site?** Set `maxPagesToCrawl` to `0`.

**Can I crawl multiple sites at once?** Yes — add several Start URLs.

**What formats can I export?** JSON, CSV, Excel, HTML and a full REST API.

### 🔗 Related actors by the same author

- **Sitemap to URL Crawler** — extract every URL from a sitemap.xml to feed this crawler.
- **Website SEO Audit Crawler** — on-page SEO audit for every page.
- **Website Image & Media Crawler** — extract all images and media for multimodal datasets.
- **JSON-LD Schema & Meta Tag Extractor** — structured data and meta tags from any page.

***

### Changelog

- **2026-05-25** — Maintenance & reliability pass: pulled the latest source and rebuilt the Actor on the current base image; build verified.

*Last reviewed: 2026-05-25.*

# Actor input Schema

## `startUrls` (type: `array`):

Websites to crawl. The crawler follows internal links from each start URL and extracts the content of every page.

## `maxPagesToCrawl` (type: `integer`):

Maximum pages to crawl per run. Set 0 for no limit (crawl the whole site).

## `saveMarkdown` (type: `boolean`):

Include the page's main content converted to Markdown.

## `saveText` (type: `boolean`):

Include the page's main content as plain text.

## `saveHtml` (type: `boolean`):

Include the cleaned main-content HTML.

## `maxConcurrency` (type: `integer`):

Number of parallel requests. Lower this if the target site rate-limits you.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://docs.apify.com"
    }
  ],
  "maxPagesToCrawl": 1000,
  "saveMarkdown": true,
  "saveText": true,
  "saveHtml": false,
  "maxConcurrency": 10
}
```

# Actor output Schema

## `url` (type: `string`):

Page URL

## `title` (type: `string`):

Page title

## `wordCount` (type: `string`):

Word count of main content

## `metaDescription` (type: `string`):

Meta description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://docs.apify.com"
        }
    ],
    "maxPagesToCrawl": 1000,
    "maxConcurrency": 10
};

// Run the Actor and wait for it to finish
const run = await client.actor("logiover/website-text-markdown-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://docs.apify.com" }],
    "maxPagesToCrawl": 1000,
    "maxConcurrency": 10,
}

# Run the Actor and wait for it to finish
run = client.actor("logiover/website-text-markdown-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://docs.apify.com"
    }
  ],
  "maxPagesToCrawl": 1000,
  "maxConcurrency": 10
}' |
apify call logiover/website-text-markdown-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=logiover/website-text-markdown-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website to Markdown & Text Crawler — AI / RAG Data",
        "description": "Crawl an entire website and extract clean, boilerplate-free main content as Markdown and plain text — ready for LLM training, RAG pipelines, embeddings and AI agents. No login, no browser, one row per page.",
        "version": "1.0",
        "x-build-id": "3v2fPEhxLd1c4swSE"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/logiover~website-text-markdown-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-logiover-website-text-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/logiover~website-text-markdown-crawler/runs": {
            "post": {
                "operationId": "runs-sync-logiover-website-text-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/logiover~website-text-markdown-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-logiover-website-text-markdown-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "Websites to crawl. The crawler follows internal links from each start URL and extracts the content of every page.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxPagesToCrawl": {
                        "title": "Max pages to crawl",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum pages to crawl per run. Set 0 for no limit (crawl the whole site)."
                    },
                    "saveMarkdown": {
                        "title": "Save Markdown",
                        "type": "boolean",
                        "description": "Include the page's main content converted to Markdown.",
                        "default": true
                    },
                    "saveText": {
                        "title": "Save plain text",
                        "type": "boolean",
                        "description": "Include the page's main content as plain text.",
                        "default": true
                    },
                    "saveHtml": {
                        "title": "Save HTML",
                        "type": "boolean",
                        "description": "Include the cleaned main-content HTML.",
                        "default": false
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Number of parallel requests. Lower this if the target site rate-limits you."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
