# Web to Markdown — AI-Ready Text from Any URL (`wsgcjj/web-to-markdown`) Actor

Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.

- **URL**: https://apify.com/wsgcjj/web-to-markdown.md
- **Developed by:** [陈俊杰](https://apify.com/wsgcjj) (community)
- **Categories:** AI, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🌐 Web to Markdown Converter — Apify Actor

将任意网页URL转换为干净的Markdown格式，专为AI/LLM数据处理场景设计。

### 📋 功能简介

- **一键抓取**：输入URL，自动获取网页HTML
- **智能提取**：自动识别并提取主体内容（文章/主要内容区块），去除广告、导航栏、页脚、侧边栏等干扰元素
- **干净输出**：使用 `markdownify` 将HTML转换为标准Markdown格式
- **可选的CSS选择器**：指定特定区域进行提取
- **错误处理完备**：HTTP错误、超时、解析异常均有妥善处理

### 🎯 适用场景

| 场景 | 说明 |
|------|------|
| **LLM训练数据准备** | 将网页内容转为结构化文本供大模型训练 |
| **RAG流水线** | 网页文档 → 向量数据库的预处理步骤 |
| **AI内容处理** | 配合LLM进行摘要、翻译、分析等工作流 |
| **数据归档** | 将在线文章保存为可读的纯文本格式 |
| **网页内容对比** | 提取不同版本的页面文本进行差异分析 |

### 📥 输入参数

| 参数 | 类型 | 必填 | 默认值 | 说明 |
|------|------|------|--------|------|
| `url` | string | ✅ | — | 目标网页URL |
| `selector` | string | ❌ | `null` | CSS选择器，指定提取的区域（如 `.article-body`） |
| `include_images` | boolean | ❌ | `false` | 是否在Markdown中包含图片链接 |

### 📤 输出字段

| 字段 | 类型 | 说明 |
|------|------|------|
| `url` | string | 源网页URL |
| `title` | string | 页面标题 |
| `markdown` | string | 转换后的Markdown文本 |
| `word_count` | integer | Markdown的单词数量 |
| `char_count` | integer | Markdown的字符数量 |
| `extracted_at` | string | 提取时间（UTC ISO 8601） |
| `error` | string | 处理失败时的错误信息 |

### 🚀 快速使用

#### 通过Apify平台

1. 打开 [Web to Markdown Converter](https://apify.com/...) Actor页面
2. 点击 **Run**
3. 输入目标URL，点击 **Start**
4. 获取Markdown输出

#### 通过Apify API

```python
import requests

response = requests.post(
    "https://api.apify.com/v2/acts/<username>~web-to-markdown/runs",
    json={
        "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
        "include_images": False
    }
)
print(response.json())
````

#### 通过Apify SDK (Python)

```python
from apify import Actor

async def main():
    async with Actor:
        run_input = {
            "url": "https://en.wikipedia.org/wiki/Python_(programming_language)",
            "include_images": False
        }
        run = await Actor.call(
            "username/web-to-markdown",
            run_input=run_input
        )
        dataset = await run.dataset.get_items()
        print(dataset[0]["markdown"][:500])
```

### 🛠 本地开发

#### 前置条件

- Python 3.14+
- Apify CLI (`npm install -g apify-cli`)

#### 本地运行

```bash
## 安装依赖
pip install -r requirements.txt

## 通过Apify CLI运行
apify run

## 或直接运行Python
python -m src
```

#### 测试

```bash
## 设置环境变量
export APIFY_LOCAL_STORAGE_DIR=./apify_storage

## 运行
apify run
```

### 📦 技术栈

- **[Apify SDK (Python)](https://docs.apify.com/sdk/python/)** — Actor框架
- **[httpx](https://www.python-httpx.org/)** — 异步HTTP客户端
- **[BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/)** — HTML解析
- **[markdownify](https://github.com/matthewwithanm/python-markdownify)** — HTML→Markdown转换

### 📄 许可证

MIT

# Actor input Schema

## `url` (type: `string`):

需要抓取并转换为Markdown格式的网页URL，例如 https://example.com/article

## `selector` (type: `string`):

指定CSS选择器来提取页面的特定区域，例如 .article-content 或 #main。留空则自动提取主体内容。

## `include_images` (type: `boolean`):

是否在Markdown输出中包含图片（![]()格式）。默认关闭以减少输出体积。

## Actor input object example

```json
{
  "url": "https://example.com/article",
  "selector": ".article-content",
  "include_images": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "https://",
    "selector": ""
};

// Run the Actor and wait for it to finish
const run = await client.actor("wsgcjj/web-to-markdown").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "url": "https://",
    "selector": "",
}

# Run the Actor and wait for it to finish
run = client.actor("wsgcjj/web-to-markdown").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "https://",
  "selector": ""
}' |
apify call wsgcjj/web-to-markdown --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=wsgcjj/web-to-markdown",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Web to Markdown — AI-Ready Text from Any URL",
        "description": "Convert any web page URL to clean Markdown format. Perfect for LLM training data, RAG pipelines, and AI content processing. Extracts main content, strips ads/nav/footers.",
        "version": "0.0",
        "x-build-id": "XS6B4WDbEqjM6QWha"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/wsgcjj~web-to-markdown/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-wsgcjj-web-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/wsgcjj~web-to-markdown/runs": {
            "post": {
                "operationId": "runs-sync-wsgcjj-web-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/wsgcjj~web-to-markdown/run-sync": {
            "post": {
                "operationId": "run-sync-wsgcjj-web-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "url": {
                        "title": "目标网址 (Target URL)",
                        "type": "string",
                        "description": "需要抓取并转换为Markdown格式的网页URL，例如 https://example.com/article"
                    },
                    "selector": {
                        "title": "CSS选择器 (CSS Selector) — 可选",
                        "type": "string",
                        "description": "指定CSS选择器来提取页面的特定区域，例如 .article-content 或 #main。留空则自动提取主体内容。"
                    },
                    "include_images": {
                        "title": "包含图片 (Include Images)",
                        "type": "boolean",
                        "description": "是否在Markdown输出中包含图片（![]()格式）。默认关闭以减少输出体积。",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
