# Webpage To Markdown (`kawsar/webpage-to-markdown`) Actor

Convert any webpage into clean, structured, LLM-ready Markdown. Handles JavaScript-rendered sites, strips ads and navigation clutter, and outputs metadata alongside content built for RAG pipelines, AI training, SEO audits, and content archiving.

- **URL**: https://apify.com/kawsar/webpage-to-markdown.md
- **Developed by:** [Kawsar](https://apify.com/kawsar) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Webpage to Markdown Converter

Convert any public webpage into clean, structured, LLM-ready Markdown in seconds. This actor fetches fully rendered pages, strips away noise like ads, navigation, and cookie banners, and outputs high-quality Markdown alongside structured metadata — ready for RAG pipelines, AI training, SEO audits, and content archiving.

---

### Why use this actor?

Most web pages are full of clutter: navigation bars, cookie notices, social share widgets, footer links. When you feed raw HTML into an LLM or a vector database, that noise degrades retrieval quality and inflates token usage. This actor does the heavy lifting — it fetches the page, extracts the meaningful content, and delivers clean Markdown that your pipelines can use directly.

- Works on **JavaScript-rendered pages** (React, Vue, Next.js, Angular, and more)
- Extracts **semantic main content** — isolates articles and body text from site chrome
- Supports **bulk processing** — up to 1,000 URLs per run
- Outputs **structured metadata** — title, description, URL, and timestamp alongside the Markdown
- Fully **configurable** — control what gets included or excluded with CSS selector rules

---

### Use cases

| Use case | How this actor helps |
|---|---|
| **RAG / vector search** | Feed noise-free page text directly into embedding pipelines for higher retrieval accuracy |
| **LLM fine-tuning** | Compile large, clean web corpora without manual preprocessing |
| **SEO auditing** | Inspect heading structure, body copy, and semantic layout across multiple URLs |
| **Content archiving** | Save readable offline copies of blog posts, documentation, and news articles |
| **AI agent memory** | Convert reference pages into Markdown for use as context in agent workflows |
| **Research automation** | Batch-convert dozens of sources into a uniform format for analysis |

---

### What data does this actor extract?

Every processed URL yields one structured record in the output dataset:

| Field | Type | Description |
|---|---|---|
| `url` | string | The original URL that was processed |
| `pageTitle` | string | The HTML `<title>` tag content |
| `pageDescription` | string | The `<meta name="description">` or Open Graph description |
| `markdown` | string | Clean, clutter-free Markdown of the page content |
| `scrapedAt` | string | UTC ISO 8601 timestamp of when the page was processed |

---

### Input parameters

| Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
| `urls` | array | `["https://apify.com"]` | Yes | List of webpage URLs to convert. Enter one URL per line. |
| `onlyMainContent` | boolean | `true` | No | Extract only the core article or body, dropping navigation, headers, and footers. |
| `includeImages` | boolean | `true` | No | Keep image references in the Markdown output. |
| `includeLinks` | boolean | `true` | No | Keep hyperlinks in the Markdown output. |
| `removeSelectors` | array | See below | No | CSS selectors to strip from the page before conversion. |
| `maxItems` | integer | `100` | No | Maximum number of URLs to process in this run (cap: 1,000). |
| `requestTimeoutSecs` | integer | `30` | No | Per-request timeout in seconds (range: 5–120). |

**Default `removeSelectors`:**
````

script, style, nav, footer, header, noscript, iframe, aside, .ads, .menu

````

#### Example input

```json
{
    "urls": [
        "https://apify.com",
        "https://docs.apify.com/academy/getting-started"
    ],
    "onlyMainContent": true,
    "includeImages": true,
    "includeLinks": true,
    "removeSelectors": [
        "script",
        "style",
        "nav",
        "footer",
        "header",
        "noscript",
        "iframe",
        "aside",
        ".ads",
        ".cookie-banner"
    ],
    "maxItems": 50,
    "requestTimeoutSecs": 30
}
````

***

### Output example

Each converted page is saved as a dataset record. Here is a typical result:

```json
{
    "url": "https://apify.com",
    "pageTitle": "Apify: The web scraping and automation platform",
    "pageDescription": "Apify is the platform where developers build, deploy, and share web scraping, data extraction, and automation tools.",
    "markdown": "## Apify\n\nApify is the platform where developers build, run, and share web scrapers and automation tools.\n\n### Get structured data from any website\n\nWe provide the hosting and infrastructure for scrapers...",
    "scrapedAt": "2026-06-10T04:15:00.000Z"
}
```

#### Failed records

If a URL cannot be fetched, the record is still saved with `null` content fields and an `error` message so your pipeline knows what to skip or retry:

```json
{
    "url": "https://example.com/404-page",
    "pageTitle": null,
    "pageDescription": null,
    "markdown": null,
    "error": "Page not found: https://example.com/404-page",
    "scrapedAt": "2026-06-10T04:15:05.000Z"
}
```

***

### How it works

1. **URL validation** — Each URL is validated for a correct scheme and host before any request is made.
2. **Page retrieval** — Pages are fetched with full JavaScript rendering support, so single-page apps and dynamic sites work out of the box.
3. **HTML cleaning** — Unwanted elements are removed using the configured CSS selector list before any content analysis begins.
4. **Main content extraction** — When enabled, the actor locates semantic content containers (`<main>`, `<article>`, `#content`, `.content`, `[role="main"]`) and discards surrounding site chrome. If no semantic container is found, it falls back to the full page body.
5. **Markdown conversion** — The cleaned HTML is converted to properly structured ATX-style Markdown, with configurable handling for images and links.
6. **Metadata extraction** — The page title and meta description are captured alongside the Markdown.
7. **Dataset output** — Each result is pushed to the Apify dataset immediately, so you can inspect partial results during a long run.

***

### FAQ

**Does this actor handle JavaScript-rendered pages?**\
Yes. The actor retrieves fully rendered page content, so sites built with React, Vue, Next.js, Angular, or any other client-side framework are handled correctly.

**How does main content extraction work?**\
When `onlyMainContent` is enabled, the actor scans the page for semantic HTML elements — `<main>`, `<article>` — and common class/ID patterns like `#content`, `.content`, `#main`. If a match is found, only that block is converted. If no match is found, the full page body is used as a fallback.

**Can I target specific sections to remove?**\
Yes. Use the `removeSelectors` input to provide any CSS selectors you want stripped before conversion. This works for custom widgets, related posts lists, tracking banners, comment sections, or any other element you want to exclude.

**What is the URL limit per run?**\
The actor processes up to 1,000 URLs per run. For larger batches, split your list across multiple runs.

**What happens if a page fails?**\
Failed pages are recorded in the dataset with `null` content and a descriptive error message. The run continues processing the remaining URLs rather than stopping on the first failure.

**What Markdown format is used?**\
Headings use ATX style (`#`, `##`, `###`), lists use hyphens (`-`), and inline formatting uses standard CommonMark conventions. The output is compatible with any Markdown renderer or LLM tokenizer.

**Can I increase the request timeout for slow sites?**\
Yes. Set `requestTimeoutSecs` to up to 120 seconds for sites that take longer to respond.

***

### Integrations and webhooks

Connect this actor to your existing tools using [Apify integrations](https://apify.com/integrations):

- **Make** (formerly Integromat) — trigger workflows when new results arrive
- **Zapier** — connect to thousands of apps automatically
- **Google Sheets / Google Drive** — export results directly to spreadsheets or Drive
- **Slack** — send notifications when a run finishes
- **Airbyte / GitHub** — sync output to data warehouses or version control
- **Webhooks** — call any HTTP endpoint as soon as results are added to the dataset

***

### Get started

1. Open the actor on Apify and click **Try for free**
2. Paste one or more URLs into the **Webpage URLs** field
3. Adjust content and selector options as needed
4. Click **Start** and view results in the **Dataset** tab

For API usage, [API docs are available](https://docs.apify.com/api/v2) for programmatic runs and dataset retrieval.

# Actor input Schema

## `urls` (type: `array`):

List of webpage URLs to convert into Markdown. Enter one URL per line.

## `onlyMainContent` (type: `boolean`):

If enabled, removes headers, footers, navigation, sidebars, and comments, preserving only the core page body or article content.

## `includeImages` (type: `boolean`):

If enabled, retains image tags and formatting in the Markdown output.

## `includeLinks` (type: `boolean`):

If enabled, retains hyperlinks in the Markdown output.

## `removeSelectors` (type: `array`):

Custom list of CSS selectors to strip from the HTML page before processing.

## `maxItems` (type: `integer`):

Maximum number of pages/URLs to process from the list during this run.

## `requestTimeoutSecs` (type: `integer`):

Per-request timeout in seconds.

## Actor input object example

```json
{
  "urls": [
    "https://apify.com"
  ],
  "onlyMainContent": true,
  "includeImages": true,
  "includeLinks": true,
  "removeSelectors": [
    "script",
    "style",
    "nav",
    "footer",
    "header",
    "noscript",
    "iframe",
    "aside",
    ".ads",
    ".menu"
  ],
  "maxItems": 100,
  "requestTimeoutSecs": 30
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://apify.com"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("kawsar/webpage-to-markdown").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": ["https://apify.com"] }

# Run the Actor and wait for it to finish
run = client.actor("kawsar/webpage-to-markdown").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://apify.com"
  ]
}' |
apify call kawsar/webpage-to-markdown --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=kawsar/webpage-to-markdown",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Webpage To Markdown",
        "description": "Convert any webpage into clean, structured, LLM-ready Markdown. Handles JavaScript-rendered sites, strips ads and navigation clutter, and outputs metadata alongside content built for RAG pipelines, AI training, SEO audits, and content archiving.",
        "version": "0.0",
        "x-build-id": "xx0Op8rCxxabLuHYU"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/kawsar~webpage-to-markdown/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-kawsar-webpage-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/kawsar~webpage-to-markdown/runs": {
            "post": {
                "operationId": "runs-sync-kawsar-webpage-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/kawsar~webpage-to-markdown/run-sync": {
            "post": {
                "operationId": "run-sync-kawsar-webpage-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "Webpage URLs",
                        "type": "array",
                        "description": "List of webpage URLs to convert into Markdown. Enter one URL per line.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "onlyMainContent": {
                        "title": "Extract only main content",
                        "type": "boolean",
                        "description": "If enabled, removes headers, footers, navigation, sidebars, and comments, preserving only the core page body or article content.",
                        "default": true
                    },
                    "includeImages": {
                        "title": "Include images",
                        "type": "boolean",
                        "description": "If enabled, retains image tags and formatting in the Markdown output.",
                        "default": true
                    },
                    "includeLinks": {
                        "title": "Include links",
                        "type": "boolean",
                        "description": "If enabled, retains hyperlinks in the Markdown output.",
                        "default": true
                    },
                    "removeSelectors": {
                        "title": "Remove CSS selectors",
                        "type": "array",
                        "description": "Custom list of CSS selectors to strip from the HTML page before processing.",
                        "default": [
                            "script",
                            "style",
                            "nav",
                            "footer",
                            "header",
                            "noscript",
                            "iframe",
                            "aside",
                            ".ads",
                            ".menu"
                        ],
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxItems": {
                        "title": "Max pages to process",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of pages/URLs to process from the list during this run.",
                        "default": 100
                    },
                    "requestTimeoutSecs": {
                        "title": "Request timeout (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "Per-request timeout in seconds.",
                        "default": 30
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
