# Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX (`scrapeworks/pandoc-document-converter`) Actor

Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.

- **URL**: https://apify.com/scrapeworks/pandoc-document-converter.md
- **Developed by:** [Nicolas van Arkens](https://apify.com/scrapeworks) (community)
- **Categories:** Developer tools, Automation, Integrations
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.00 / 1,000 converted documents

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Pandoc Document Converter — HTML to Markdown, Markdown to DOCX, EPUB, PPTX & more

Convert documents between formats in bulk, with no install and no servers — this Actor wraps **[Pandoc](https://pandoc.org)**, the universal document converter, and runs it in the cloud. Feed it **URLs** (it fetches them for you) and/or **raw text**, pick an output format, and get one converted document per input back.

Typical jobs it does in seconds:

- **HTML → Markdown** (turn web pages into clean Markdown for LLMs, RAG pipelines, or docs)
- **Markdown → DOCX** (deliver Word documents from generated text)
- **Markdown → EPUB** (package content as an e-book)
- **Markdown → PPTX** (headings become PowerPoint slides)
- LaTeX, reStructuredText, Org-mode, MediaWiki, Textile, DocBook, OPML, CSV in — Markdown, HTML, plain text, RTF, AsciiDoc, ODT and more out

### What data you get

One dataset row per converted document:

| Field | Description |
|---|---|
| `source` | The URL, or `text #N` for raw-text inputs |
| `ok` | `true` when conversion succeeded |
| `inputFormat` | The detected (or forced) source format |
| `outputFormat` | The format you requested |
| `output` | The converted document, inline — for text formats (Markdown, HTML, plain, LaTeX, …) |
| `outputCharacters` | Length of the inline output |
| `downloadUrl` | Direct download link — for binary formats (DOCX, PPTX, EPUB, ODT), stored in the run's key-value store |
| `outputBytes` | Size of the binary file |

You are only charged for successful conversions — failed fetches or conversions are reported with `ok: false` and never billed.

### Input example

```json
{
    "urls": ["https://example.com/"],
    "texts": ["## Quarterly report\n\nRevenue grew **18%** quarter over quarter.\n\n- New customers: 412\n- Churn: 2.1%"],
    "inputFormat": "auto",
    "outputFormat": "gfm"
}
````

`inputFormat: "auto"` detects HTML vs Markdown per item (Content-Type header, file extension, or content sniffing). Set it explicitly for LaTeX, RST, Org, MediaWiki, Textile, DocBook, OPML or CSV sources.

### Output sample (real run)

```json
{
    "source": "https://example.com/",
    "ok": true,
    "inputFormat": "html",
    "outputFormat": "gfm",
    "output": "## Example Domain\n\nThis domain is for use in documentation examples without needing permission. Avoid use in operations.\n\n[Learn more](https://iana.org/domains/example)\n",
    "outputCharacters": 192
}
```

And a binary conversion (Markdown → Word):

```json
{
    "source": "text #1",
    "ok": true,
    "inputFormat": "markdown",
    "outputFormat": "docx",
    "downloadUrl": "https://api.apify.com/v2/key-value-stores/<store-id>/records/converted-1.docx",
    "outputBytes": 10580
}
```

### Use cases

- **Feed web content to LLMs** — convert pages to GitHub-flavored Markdown (`gfm`) with `--wrap=none` applied automatically, ready for prompts, embeddings, or RAG ingestion.
- **Automated report delivery** — your pipeline produces Markdown; this Actor turns it into DOCX or PPTX your stakeholders actually open. Chain it after any scraper or AI Actor via Apify integrations.
- **Publishing workflows** — convert a batch of Markdown chapters or HTML articles into EPUB e-books, or migrate docs between wikis (MediaWiki ⇄ Markdown ⇄ reStructuredText).

### FAQ

**Which formats are supported?**
Input: HTML, Markdown (Pandoc / GitHub-flavored / CommonMark), LaTeX, reStructuredText, Org, MediaWiki, Textile, DocBook, OPML, CSV — or auto-detect. Output: Markdown (GFM / Pandoc / CommonMark), HTML, plain text, DOCX, PPTX, EPUB, ODT, RTF, reStructuredText, LaTeX, AsciiDoc, Org, MediaWiki, Textile, OPML.

**How do I get the DOCX / EPUB / PPTX files?**
Binary outputs are stored in the run's key-value store; each dataset row contains a direct `downloadUrl`. Text outputs come back inline in the dataset.

**Does it extract the article from a web page?**
No — it converts the page **verbatim**, exactly like running `pandoc` on the HTML. Navigation and boilerplate present in the HTML will be present in the output. For readability extraction, run a content-extraction Actor first and pipe its HTML here.

**Is PDF output supported?**
Not yet — PDF generation needs a LaTeX engine. Convert to DOCX or HTML and print/export to PDF, or ask for it in the Actor's Issues tab.

**What does it cost?**
A small fee per successfully converted document (pay-per-event). Failed items are never charged.

# Actor input Schema

## `urls` (type: `array`):

Web pages or raw files to download and convert. Each URL becomes one converted document in the dataset. The input format is auto-detected from the response (HTML pages, .md files, etc.) unless you override it with 'Input format'.

## `texts` (type: `array`):

Raw document contents to convert (e.g. Markdown or HTML strings). Each entry becomes one converted document in the dataset. Use this when you already have the content and don't need fetching.

## `inputFormat` (type: `string`):

Format of the source documents. 'auto' detects HTML vs Markdown per item (from the HTTP Content-Type, file extension, or content). Set explicitly when converting LaTeX, reStructuredText, Org, MediaWiki, Textile, DocBook, OPML or CSV.

## `outputFormat` (type: `string`):

Format to convert every document into. Text formats (Markdown, HTML, plain text, LaTeX, ...) are returned inline in the dataset; binary formats (DOCX, PPTX, EPUB, ODT) are stored in the run's key-value store and the dataset row contains a direct download URL.

## `standalone` (type: `boolean`):

Produce a complete document with header and metadata (e.g. a full HTML page with <head>, or an RTF/LaTeX document that compiles on its own) instead of a fragment. Binary formats (DOCX, PPTX, EPUB, ODT) are always standalone.

## `documentTitle` (type: `string`):

Title metadata embedded in standalone and binary outputs (shown e.g. as the EPUB book title or DOCX document title). If empty, the source URL or 'Converted document' is used.

## Actor input object example

```json
{
  "urls": [
    "https://example.com/"
  ],
  "texts": [
    "# Hello\n\nThis is **Markdown** converted by Pandoc.\n\n- works with lists\n- and [links](https://pandoc.org)"
  ],
  "inputFormat": "auto",
  "outputFormat": "gfm",
  "standalone": false,
  "documentTitle": ""
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://example.com/"
    ],
    "texts": [
        "# Hello\n\nThis is **Markdown** converted by Pandoc.\n\n- works with lists\n- and [links](https://pandoc.org)"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("scrapeworks/pandoc-document-converter").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": ["https://example.com/"],
    "texts": ["""# Hello

This is **Markdown** converted by Pandoc.

- works with lists
- and [links](https://pandoc.org)"""],
}

# Run the Actor and wait for it to finish
run = client.actor("scrapeworks/pandoc-document-converter").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://example.com/"
  ],
  "texts": [
    "# Hello\\n\\nThis is **Markdown** converted by Pandoc.\\n\\n- works with lists\\n- and [links](https://pandoc.org)"
  ]
}' |
apify call scrapeworks/pandoc-document-converter --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=scrapeworks/pandoc-document-converter",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Pandoc Document Converter - HTML to Markdown, DOCX, EPUB, PPTX",
        "description": "Convert documents between formats with Pandoc in the cloud: HTML to Markdown for LLMs and RAG, Markdown to Word DOCX, EPUB e-books, PowerPoint PPTX, LaTeX, reStructuredText and more. Feed it URLs or raw text, get one converted document per input.",
        "version": "0.1",
        "x-build-id": "0akg9blNPvCAqSLyV"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/scrapeworks~pandoc-document-converter/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-scrapeworks-pandoc-document-converter",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/scrapeworks~pandoc-document-converter/runs": {
            "post": {
                "operationId": "runs-sync-scrapeworks-pandoc-document-converter",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/scrapeworks~pandoc-document-converter/run-sync": {
            "post": {
                "operationId": "run-sync-scrapeworks-pandoc-document-converter",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "urls": {
                        "title": "URLs to fetch and convert",
                        "type": "array",
                        "description": "Web pages or raw files to download and convert. Each URL becomes one converted document in the dataset. The input format is auto-detected from the response (HTML pages, .md files, etc.) unless you override it with 'Input format'.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "texts": {
                        "title": "Raw texts to convert",
                        "type": "array",
                        "description": "Raw document contents to convert (e.g. Markdown or HTML strings). Each entry becomes one converted document in the dataset. Use this when you already have the content and don't need fetching."
                    },
                    "inputFormat": {
                        "title": "Input format",
                        "enum": [
                            "auto",
                            "html",
                            "markdown",
                            "gfm",
                            "commonmark",
                            "latex",
                            "rst",
                            "org",
                            "mediawiki",
                            "textile",
                            "docbook",
                            "opml",
                            "csv"
                        ],
                        "type": "string",
                        "description": "Format of the source documents. 'auto' detects HTML vs Markdown per item (from the HTTP Content-Type, file extension, or content). Set explicitly when converting LaTeX, reStructuredText, Org, MediaWiki, Textile, DocBook, OPML or CSV.",
                        "default": "auto"
                    },
                    "outputFormat": {
                        "title": "Output format",
                        "enum": [
                            "gfm",
                            "markdown",
                            "commonmark",
                            "html",
                            "plain",
                            "docx",
                            "pptx",
                            "epub",
                            "odt",
                            "rtf",
                            "rst",
                            "latex",
                            "asciidoc",
                            "org",
                            "mediawiki",
                            "textile",
                            "opml"
                        ],
                        "type": "string",
                        "description": "Format to convert every document into. Text formats (Markdown, HTML, plain text, LaTeX, ...) are returned inline in the dataset; binary formats (DOCX, PPTX, EPUB, ODT) are stored in the run's key-value store and the dataset row contains a direct download URL.",
                        "default": "gfm"
                    },
                    "standalone": {
                        "title": "Standalone document",
                        "type": "boolean",
                        "description": "Produce a complete document with header and metadata (e.g. a full HTML page with <head>, or an RTF/LaTeX document that compiles on its own) instead of a fragment. Binary formats (DOCX, PPTX, EPUB, ODT) are always standalone.",
                        "default": false
                    },
                    "documentTitle": {
                        "title": "Document title",
                        "type": "string",
                        "description": "Title metadata embedded in standalone and binary outputs (shown e.g. as the EPUB book title or DOCX document title). If empty, the source URL or 'Converted document' is used.",
                        "default": ""
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
