# PDF → RAG Chunks (Token-Aware, Vector-Ready) (`gochujang/pdf-rag-chunker`) Actor

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

- **URL**: https://apify.com/gochujang/pdf-rag-chunker.md
- **Developed by:** [Hojun Lee](https://apify.com/gochujang) (community)
- **Categories:** Developer tools, Automation, Other
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PDF → RAG Chunks

> Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. **Configurable chunk size + overlap. No LLM cost** (zero tokens). Vector-ready output. **$0.005 per PDF + $0.0002 per chunk.**

---

### Why this exists

To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:
1. Download → extract text per page
2. Chunk into semantic segments (1000-2000 chars typical)
3. Optional: embed each chunk and store in vector DB
4. Query: embed question, retrieve top-k chunks, ask LLM

This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's `text-embedding-3-small`, Voyage AI, Cohere Embed, or any embedding model.

Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: **$0.50 per 1K pages**.

---

### What you get

#### Summary row (one per PDF)
```json
{
  "_type": "summary",
  "url": "https://www.sec.gov/.../aapl-10k.pdf",
  "ok": true,
  "page_count": 80,
  "title": "Apple Inc. — Annual Report 2024",
  "author": "Apple Inc.",
  "chunk_size_chars": 1500,
  "overlap_chars": 200
}
````

#### Per-chunk row

```json
{
  "_type": "chunk",
  "url": "https://...",
  "page": 12,
  "chunk_index": 0,
  "global_chunk_index": 17,
  "text": "Item 1A. Risk Factors\n\nOur business is...",
  "char_count": 1480,
  "token_estimate": 370
}
```

***

### Quick start

#### Single PDF

```json
{
  "url": "https://www.example.com/report.pdf"
}
```

#### Batch with custom chunk size

```json
{
  "urls": [
    "https://...filing1.pdf",
    "https://...filing2.pdf"
  ],
  "chunkSizeChars": 2000,
  "overlapChars": 300,
  "maxPages": 100
}
```

#### Optimize for OpenAI text-embedding-3-small (8K-token max)

```json
{
  "url": "https://...",
  "chunkSizeChars": 1500,
  "overlapChars": 200
}
```

***

### Recommended chunk sizes

| Embedding model | chunkSizeChars | Notes |
|---|---|---|
| OpenAI text-embedding-3-small | 1500 | ~375 tokens, fits well |
| OpenAI text-embedding-3-large | 2000 | ~500 tokens |
| Voyage voyage-3-large | 1500 | optimal balance |
| Cohere embed-v3 | 1800 | works with 512-token chunks |

Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.

***

### Pricing

**Pay-Per-Event**:

- `$0.005` per PDF processed
- `$0.0002` per chunk emitted

| Run | Chunks | Cost |
|---|---|---|
| One 80-page 10-K | ~200 | $0.045 |
| Batch of 100 papers @ 20 pages | ~6000 | $1.70 |
| Compliance archive 1000 PDFs | ~80000 | $21 |

vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).

***

### Pipeline pattern: PDFs → vector DB

```python
import apify_client, openai, pinecone

## 1. Chunk PDFs
client = apify_client.ApifyClient(token)
run = client.actor("gochujang/pdf-rag-chunker").call(run_input={
    "urls": ["https://...filing.pdf"],
    "chunkSizeChars": 1500,
})

## 2. Embed each chunk
chunks = list(client.dataset(run["defaultDatasetId"]).iterate_items())
chunks = [c for c in chunks if c.get("_type") == "chunk"]
embeddings = openai.embeddings.create(
    model="text-embedding-3-small",
    input=[c["text"] for c in chunks],
).data

## 3. Upsert to vector DB
index = pinecone.Index("rag-docs")
index.upsert([
    {"id": f"{c['url']}-{c['global_chunk_index']}",
     "values": embeddings[i].embedding,
     "metadata": {"url": c["url"], "page": c["page"]}}
    for i, c in enumerate(chunks)
])
```

***

### Limitations

- **Scanned PDFs (image-only)** — Returns 0 chunks. Use OCR-equipped actor.
- **Multi-column research papers** — Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
- **No embedding included** — Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.

***

### Related actors (same author)

- [PDF Text & Table Extractor](https://apify.com/gochujang/pdf-text-extractor) — Same engine, returns full text instead of chunks
- [Web Page → Markdown Converter](https://apify.com/gochujang/web-to-markdown) — HTML equivalent
- [Article Summarizer](https://apify.com/gochujang/article-summarizer) — For one-shot summaries
- [JSON Schema Generator](https://apify.com/gochujang/json-schema-generator)

***

### Feedback

A short review helps RAG engineers find it: [Leave a review on Apify Store](https://apify.com/gochujang/pdf-rag-chunker#reviews)

# Actor input Schema

## `urls` (type: `array`):

PDF URLs to download and chunk.

## `url` (type: `string`):

Used when 'urls' is empty.

## `chunkSizeChars` (type: `integer`):

Target chars per chunk. ~1500 chars ≈ ~375 tokens (close to embedding-3 sweet spot).

## `overlapChars` (type: `integer`):

Char overlap between consecutive chunks (improves retrieval recall).

## `maxPages` (type: `integer`):

Stop after this many pages.

## `skipEmpty` (type: `boolean`):

Skip pages with no extractable text (e.g. scanned images).

## `userAgent` (type: `string`):

Custom UA.

## Actor input object example

```json
{
  "urls": [],
  "url": "",
  "chunkSizeChars": 1500,
  "overlapChars": 200,
  "maxPages": 200,
  "skipEmpty": true,
  "userAgent": ""
}
```

# Actor output Schema

## `dataset` (type: `string`):

No description

## `summary` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("gochujang/pdf-rag-chunker").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("gochujang/pdf-rag-chunker").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call gochujang/pdf-rag-chunker --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=gochujang/pdf-rag-chunker",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PDF → RAG Chunks (Token-Aware, Vector-Ready)",
        "description": "Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.",
        "version": "0.1",
        "x-build-id": "OqixONeVnvAZaj7wF"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/gochujang~pdf-rag-chunker/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-gochujang-pdf-rag-chunker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/gochujang~pdf-rag-chunker/runs": {
            "post": {
                "operationId": "runs-sync-gochujang-pdf-rag-chunker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/gochujang~pdf-rag-chunker/run-sync": {
            "post": {
                "operationId": "run-sync-gochujang-pdf-rag-chunker",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "urls": {
                        "title": "PDF URLs",
                        "type": "array",
                        "description": "PDF URLs to download and chunk.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "url": {
                        "title": "Single PDF URL (shortcut)",
                        "type": "string",
                        "description": "Used when 'urls' is empty.",
                        "default": ""
                    },
                    "chunkSizeChars": {
                        "title": "Chunk size (chars)",
                        "minimum": 200,
                        "maximum": 8000,
                        "type": "integer",
                        "description": "Target chars per chunk. ~1500 chars ≈ ~375 tokens (close to embedding-3 sweet spot).",
                        "default": 1500
                    },
                    "overlapChars": {
                        "title": "Overlap (chars)",
                        "minimum": 0,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Char overlap between consecutive chunks (improves retrieval recall).",
                        "default": 200
                    },
                    "maxPages": {
                        "title": "Max pages per PDF",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Stop after this many pages.",
                        "default": 200
                    },
                    "skipEmpty": {
                        "title": "Skip empty pages",
                        "type": "boolean",
                        "description": "Skip pages with no extractable text (e.g. scanned images).",
                        "default": true
                    },
                    "userAgent": {
                        "title": "User-Agent",
                        "type": "string",
                        "description": "Custom UA.",
                        "default": ""
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
