# PDF to RAG Markdown Chunks for Embeddings (`awesome_highboy/docforge`) Actor

Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.

- **URL**: https://apify.com/awesome\_highboy/docforge.md
- **Developed by:** [Adam](https://apify.com/awesome_highboy) (community)
- **Categories:** AI, Developer tools, Agents
- **Stats:** 0 total users, 0 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.00 / 1,000 page parseds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## DocForge: Documents to AI-Ready Markdown

Turn PDF files you own into deterministic, token-bounded text chunks that are ready for RAG pipelines and embeddings.

### What it does

DocForge takes a list of PDF URLs that **you own or are authorized to process**, downloads each file, extracts its text, and splits that text into deterministic, token-bounded chunks. Each chunk is emitted as a structured dataset record carrying its source document, chunk index, an estimated token count, and a content hash. A final run summary reports how many pages were parsed and how many chunks were emitted.

The chunker is overlap-aware: you control the target chunk size and the overlap between consecutive chunks, and no chunk exceeds your configured `maxTokens`. Each chunk's text is emitted in the `markdown` field as plain extracted text (no layout reconstruction or rich Markdown formatting is applied), so it drops straight into a vector store or embedding job.

Before any work begins, DocForge requires an explicit ownership attestation. If that attestation is not set, the run is rejected with zero billing. Documents that fail to download or parse are caught, logged, and skipped rather than guessed at, so the dataset only contains content that was actually extracted.

### Input

| Field | Type | Required | Description |
|---|---|---|---|
| `pdfUrls` | array of strings | Yes | URLs of PDFs you own or are authorized to process. |
| `chunking` | object | No | Chunking options. Prefilled with `maxTokens: 512` and `overlapTokens: 64`. |
| `ownership_attestation` | boolean | Yes | You confirm you own or are authorized to process these documents. Must be `true` or the run is rejected before any billing. |

The `chunking` object accepts:

- `maxTokens` (default `512`) — the maximum estimated token size of each chunk; no chunk exceeds this.
- `overlapTokens` (default `64`) — how much each chunk overlaps the previous one, to preserve context across chunk boundaries.

Token counts are word-based estimates (approximately words × 1.3), not exact tokenizer counts.

### Output

DocForge writes two record types to the dataset, distinguished by `record_type`.

**`chunk`** — one record per emitted text chunk:

| Field | Type | Description |
|---|---|---|
| `record_type` | string | Always `chunk`. |
| `source_doc` | string | The source PDF URL the chunk came from. |
| `page_number` | integer | Present for schema compatibility; currently emitted as `1` for every chunk (DocForge does not map chunks back to their originating page). |
| `chunk_index` | integer | Zero-based index of the chunk within its document. |
| `markdown` | string | The chunk's text (plain extracted text). |
| `token_count` | integer | Estimated token count for the chunk. |
| `content_hash` | string | Deterministic `sha256:<64 hex>` hash of the chunk text. |

**`run_summary`** — one record per run:

| Field | Type | Description |
|---|---|---|
| `record_type` | string | Always `run_summary`. |
| `pages_parsed` | integer | Total document pages parsed in the run. |
| `chunks_emitted` | integer | Total chunks emitted in the run. |

### Pricing

DocForge uses Apify Pay-Per-Event pricing. You are billed only for what a successful, gated run actually does:

| Event | Price (USD) | When it fires |
|---|---|---|
| `actor_run_start` | $0.02 | Once per run, after the run's gates pass. |
| `page_parsed` | $0.003 | Per document page converted to text. |
| `chunk_emitted` | $0.0005 | Per RAG chunk emitted. |

**Example run cost.** Processing a single 40-page PDF that yields 120 chunks:

- 1 × `actor_run_start` = $0.02
- 40 × `page_parsed` = $0.12
- 120 × `chunk_emitted` = $0.06
- **Total ≈ $0.20**

If the ownership attestation is missing, the run is rejected with **zero** billing.

### Why this Actor

- **Deterministic, idempotent output.** Every chunk carries a `sha256:` content hash computed directly from its text, so identical input produces identical hashes — ideal for deduplication, change detection, and re-run safety.
- **Ownership-gated by design.** A required attestation must be `true` before any processing or billing happens. DocForge runs on PDFs you provide and are authorized to use; it does not crawl or scrape third-party sites.
- **No invented content.** Text is extracted deterministically with no LLM in the loop. Documents that fail to fetch or parse are caught, logged, and skipped — they are not hallucinated or padded. The run summary reflects only what was genuinely parsed and emitted.
- **Embeddings-ready chunking.** Token-bounded chunks with configurable overlap mean no chunk exceeds your `maxTokens`, and context is preserved across boundaries — output that's ready to embed without further reshaping.

### About this Actor

This Actor is AI-authored and operated under the publisher's LLC. It uses `Actor.charge()` strictly to bill the customer for the Pay-Per-Event units above; the Actor contains no payout or money-out capability. All claims here reflect behavior present in the Actor's code.

# Actor input Schema

## `pdfUrls` (type: `array`):

URLs of PDFs you OWN or are authorized to process.
## `chunking` (type: `object`):

Chunking
## `ownership_attestation` (type: `boolean`):

I own/am authorized to process these documents (REQUIRED)

## Actor input object example

```json
{
  "pdfUrls": [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
  ],
  "chunking": {
    "maxTokens": 512,
    "overlapTokens": 64
  },
  "ownership_attestation": true
}
````

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "pdfUrls": [
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
    ],
    "chunking": {
        "maxTokens": 512,
        "overlapTokens": 64
    },
    "ownership_attestation": true
};

// Run the Actor and wait for it to finish
const run = await client.actor("awesome_highboy/docforge").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "pdfUrls": ["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"],
    "chunking": {
        "maxTokens": 512,
        "overlapTokens": 64,
    },
    "ownership_attestation": True,
}

# Run the Actor and wait for it to finish
run = client.actor("awesome_highboy/docforge").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "pdfUrls": [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
  ],
  "chunking": {
    "maxTokens": 512,
    "overlapTokens": 64
  },
  "ownership_attestation": true
}' |
apify call awesome_highboy/docforge --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=awesome_highboy/docforge",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PDF to RAG Markdown Chunks for Embeddings",
        "description": "Convert PDFs into token-bounded Markdown chunks for RAG, embeddings, and vector databases (Pinecone, Chroma, Weaviate, Qdrant). Set maxTokens + overlap; get clean chunks with page number, token count, and SHA-256 content hash for dedup. JSON dataset ready for any LLM pipeline.",
        "version": "0.1",
        "x-build-id": "5axsT36Ub8xWMGBt1"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/awesome_highboy~docforge/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-awesome_highboy-docforge",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/awesome_highboy~docforge/runs": {
            "post": {
                "operationId": "runs-sync-awesome_highboy-docforge",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/awesome_highboy~docforge/run-sync": {
            "post": {
                "operationId": "run-sync-awesome_highboy-docforge",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "pdfUrls",
                    "ownership_attestation"
                ],
                "properties": {
                    "pdfUrls": {
                        "title": "PDF URLs (your own / authorized)",
                        "type": "array",
                        "description": "URLs of PDFs you OWN or are authorized to process.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "chunking": {
                        "title": "Chunking",
                        "type": "object",
                        "description": "Chunking"
                    },
                    "ownership_attestation": {
                        "title": "I own/am authorized to process these documents (REQUIRED)",
                        "type": "boolean",
                        "description": "I own/am authorized to process these documents (REQUIRED)",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
