# PDF to Markdown Converter - Extract & Format Text (`ntriqpro/pdf-to-markdown`) Actor

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

- **URL**: https://apify.com/ntriqpro/pdf-to-markdown.md
- **Developed by:** [daehwan kim](https://apify.com/ntriqpro) (community)
- **Categories:** AI, Business
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

$50.00 / 1,000 pdf converteds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PDF to Markdown Converter

Extract clean, usable text from any PDF — research papers, contracts, reports, manuals — and output structured Markdown ready for LLMs, RAG pipelines, or document analysis.

No external APIs. No proprietary services. Built on open source.

### Why Use This

Most PDFs are locked — the text is there, but buried in binary format that LLMs can't read. This Actor extracts the text, cleans it up, and returns it as Markdown you can immediately feed into any AI workflow.

**$0.05 per PDF.** No subscription, no monthly fee, no setup.

### Use Cases

- **RAG pipelines** — Convert research papers, whitepapers, or documentation PDFs into text chunks before embedding
- **Contract analysis** — Extract legal document text for LLM review
- **Report processing** — Batch-process financial reports, audit documents, or regulatory filings
- **Knowledge base ingestion** — Convert PDF manuals and guides into searchable text
- **Academic research** — Process arXiv papers, theses, or journal articles at scale

### Input

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `pdfUrl` | string | ✅ | Direct URL to a machine-readable PDF file |
| `includePageNumbers` | boolean | ❌ | Insert `--- Page N ---` markers between pages (default: false) |
| `maxPages` | integer | ❌ | Limit pages processed. 0 = all pages (default: 0) |

```json
{
  "pdfUrl": "https://arxiv.org/pdf/2305.10601",
  "includePageNumbers": true,
  "maxPages": 20
}
````

### Output

One item per PDF pushed to the dataset:

| Field | Type | Description |
|-------|------|-------------|
| `pdfUrl` | string | Source PDF URL |
| `pageCount` | integer | Number of pages processed |
| `wordCount` | integer | Total words extracted |
| `markdown` | string | Extracted text in Markdown format |
| `disclaimer` | string | Accuracy disclaimer |

```json
{
  "pdfUrl": "https://arxiv.org/pdf/2305.10601",
  "pageCount": 15,
  "wordCount": 8432,
  "markdown": "## Tree of Thoughts: Deliberate Problem Solving with Large Language Models\n\n### Abstract\n\nLanguage models are increasingly being deployed for general problem solving..."
}
```

### Pricing

- **$0.05** per PDF converted
- Charged only on successful conversion
- No charge for validation errors or failed runs

### Quick Start

#### curl

```bash
curl -X POST https://api.apify.com/v2/acts/{ACTOR_ID}/runs \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "pdfUrl": "https://arxiv.org/pdf/2305.10601",
    "includePageNumbers": true,
    "maxPages": 20
  }'
```

#### JavaScript (Apify Client)

```javascript
const { ApifyClient } = require('apify-client');

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

const run = await client.actor('YOUR_ACTOR_ID').call({
  pdfUrl: 'https://arxiv.org/pdf/2305.10601',
  includePageNumbers: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].markdown);
```

### Limitations

| Limitation | Details |
|-----------|---------|
| **Scanned PDFs** | Not supported — requires machine-readable text layers |
| **Image-only PDFs** | Will return minimal or empty text |
| **Encrypted PDFs** | Password-protected files cannot be parsed |
| **Non-Latin scripts** | Accuracy varies for Arabic, CJK, and other scripts |
| **Complex layouts** | Multi-column or heavily formatted PDFs may have extraction quirks |

Always verify extracted text against the original for critical use cases.

### Technology

- **[pdf-parse](https://github.com/modularcode/pdf-parse)** — MIT License — PDF text extraction
- **Apify SDK** — Apache 2.0 License — Actor runtime and dataset management

### Disclaimer

This tool extracts text from PDF files using open source libraries. Accuracy depends on PDF structure and encoding. Results should be reviewed for critical use cases. Not a substitute for professional document review.

***

### 🔗 Related Actors by ntriqpro

Extend this actor with the ntriqpro intelligence network:

- [**blueprint-intelligence**](https://apify.com/ntriqpro/blueprint-intelligence) — AI blueprint analyzer for construction & architectural PDFs
- [**invoice-extraction-mcp**](https://apify.com/ntriqpro/invoice-extraction-mcp) — Structured extraction of line items from PDF invoices
- [**content-factory**](https://apify.com/ntriqpro/content-factory) — Turn PDFs into quizzes, flashcards, slide decks, podcast scripts

### ⭐ Love it? Leave a Review

Your rating helps other professionals discover this actor. [Rate it here](https://apify.com/ntriqpro/pdf-to-markdown/reviews).

# Actor input Schema

## `pdfUrl` (type: `string`):

Direct URL to the PDF file to convert

## `includePageNumbers` (type: `boolean`):

Insert page separator markers between pages

## `maxPages` (type: `integer`):

Maximum number of pages to convert. 0 means all pages.

## Actor input object example

```json
{
  "pdfUrl": "https://www.w3.org/WAI/WCAG21/wcag21.pdf"
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "pdfUrl": "https://www.w3.org/WAI/WCAG21/wcag21.pdf",
    "includePageNumbers": false,
    "maxPages": 0
};

// Run the Actor and wait for it to finish
const run = await client.actor("ntriqpro/pdf-to-markdown").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "pdfUrl": "https://www.w3.org/WAI/WCAG21/wcag21.pdf",
    "includePageNumbers": False,
    "maxPages": 0,
}

# Run the Actor and wait for it to finish
run = client.actor("ntriqpro/pdf-to-markdown").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "pdfUrl": "https://www.w3.org/WAI/WCAG21/wcag21.pdf",
  "includePageNumbers": false,
  "maxPages": 0
}' |
apify call ntriqpro/pdf-to-markdown --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ntriqpro/pdf-to-markdown",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PDF to Markdown Converter - Extract & Format Text",
        "description": "Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.",
        "version": "1.0",
        "x-build-id": "vrY7XMUpO3uwDpOba"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ntriqpro~pdf-to-markdown/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ntriqpro-pdf-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ntriqpro~pdf-to-markdown/runs": {
            "post": {
                "operationId": "runs-sync-ntriqpro-pdf-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ntriqpro~pdf-to-markdown/run-sync": {
            "post": {
                "operationId": "run-sync-ntriqpro-pdf-to-markdown",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "pdfUrl"
                ],
                "properties": {
                    "pdfUrl": {
                        "title": "PDF URL",
                        "type": "string",
                        "description": "Direct URL to the PDF file to convert"
                    },
                    "includePageNumbers": {
                        "title": "Include Page Numbers",
                        "type": "boolean",
                        "description": "Insert page separator markers between pages"
                    },
                    "maxPages": {
                        "title": "Max Pages (0 = all)",
                        "type": "integer",
                        "description": "Maximum number of pages to convert. 0 means all pages."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
