# PDF Text Extractor API - URL to Text, Per-Page, Batch (`gratifying_graph/pdf-extract-api`) Actor

Turn any public PDF URL into clean text and metadata. Per-page output, batch processing, and a synchronous API mode for AI agents. Pay per page extracted, cheaper than the alternatives.

- **URL**: https://apify.com/gratifying\_graph/pdf-extract-api.md
- **Developed by:** [Jimmy A](https://apify.com/gratifying_graph) (community)
- **Categories:** Developer tools, AI
- **Stats:** 1 total users, 0 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $2.00 / 1,000 page extracteds

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PDF Text Extractor API - URL to text, per-page, batch

Give it public PDF URLs, get back **clean text and document metadata**. One block per page or per document, batch-capable, and callable as a **synchronous API** so AI agents and automations can extract PDFs on demand.

No OCR needed for digital PDFs, no upload step, no key. Pay per page extracted - cheaper than comparable actors charging $0.022-0.04 per page.

### What it does

1. Fetches each PDF URL (redirects followed, 60s timeout)
2. Extracts text page by page with line reconstruction (not one giant word soup)
3. Reads the document's own metadata (title, author, producer, dates) as published in the file
4. Outputs one structured record per document, with per-page text blocks if you want them

### Use cases

- **RAG / AI pipelines**: turn report URLs into chunks for embedding, page-aligned
- **Agents**: call the standby endpoint as a tool - "read this PDF and answer"
- **Document monitoring**: pair with a scheduler to extract recurring reports (filings, government publications, price lists)
- **Data entry automation**: pull text from invoices, spec sheets, catalogs you have rights to process
- **Research**: batch-extract paper PDFs into searchable text

### Input

```json
{
  "pdfUrls": [
    "https://arxiv.org/pdf/1706.03762",
    "https://example.com/annual-report.pdf"
  ],
  "perPage": true,
  "maxPages": 500
}
````

### Output

```json
{
  "url": "https://arxiv.org/pdf/1706.03762",
  "pageCount": 15,
  "pagesExtracted": 15,
  "truncated": false,
  "metadata": { "title": null, "author": null, "producer": "pdfTeX", "creationDate": "..." },
  "pages": [
    { "page": 1, "text": "Attention Is All You Need\n..." }
  ]
}
```

Set `perPage: false` for a single `text` field per document. Failed URLs produce a record with an `error` field instead of killing the run.

### API / Standby mode for AI agents

```
GET /?url=https://example.com/file.pdf&perPage=true&maxPages=50
```

Returns the full extraction JSON synchronously. Works as a tool for agent frameworks that support Apify actors.

### Pricing

| Event | Price |
|---|---|
| Actor start | $0.0005 |
| Per page extracted | $0.002 |
| API call (standby) | $0.02 |

A 40-page report costs $0.08. Comparable actors charge $0.022-0.04 per page - 10-20x more.

### FAQ

**Does it do OCR on scanned PDFs?**
Not in this version. It extracts the text layer of digital PDFs (the overwhelming majority of reports, papers, and filings). Scanned-image PDFs return empty pages; an OCR tier is planned - ask in Issues if you need it.

**How are lines handled?**
Text items are regrouped by their position on the page, so paragraphs read naturally instead of being one long line.

**Maximum size?**
Default cap is 500 pages per document (configurable). Very large files are limited by fetch timeout (60s).

**Password-protected PDFs?**
Not supported. Public, unencrypted documents only.

**CSV/Excel export?**
Every Apify dataset exports as JSON, CSV, or Excel via the platform.

# Actor input Schema

## `pdfUrls` (type: `array`):

Direct URLs to PDF files. Each is fetched and extracted.

## `perPage` (type: `boolean`):

Output one text block per page instead of one block per document.

## `maxPages` (type: `integer`):

Stop extracting after this many pages per document.

## `includeMetadata` (type: `boolean`):

Include the PDF's own metadata fields (title, author, producer, dates) as published in the file.

## Actor input object example

```json
{
  "pdfUrls": [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
  ],
  "perPage": true,
  "maxPages": 500,
  "includeMetadata": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "pdfUrls": [
        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("gratifying_graph/pdf-extract-api").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "pdfUrls": ["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"] }

# Run the Actor and wait for it to finish
run = client.actor("gratifying_graph/pdf-extract-api").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "pdfUrls": [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
  ]
}' |
apify call gratifying_graph/pdf-extract-api --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=gratifying_graph/pdf-extract-api",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PDF Text Extractor API - URL to Text, Per-Page, Batch",
        "description": "Turn any public PDF URL into clean text and metadata. Per-page output, batch processing, and a synchronous API mode for AI agents. Pay per page extracted, cheaper than the alternatives.",
        "version": "1.0",
        "x-build-id": "gEXGTyvpJYZgSNU07"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/gratifying_graph~pdf-extract-api/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-gratifying_graph-pdf-extract-api",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/gratifying_graph~pdf-extract-api/runs": {
            "post": {
                "operationId": "runs-sync-gratifying_graph-pdf-extract-api",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/gratifying_graph~pdf-extract-api/run-sync": {
            "post": {
                "operationId": "run-sync-gratifying_graph-pdf-extract-api",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "pdfUrls"
                ],
                "properties": {
                    "pdfUrls": {
                        "title": "PDF URLs",
                        "type": "array",
                        "description": "Direct URLs to PDF files. Each is fetched and extracted.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "perPage": {
                        "title": "Per-page output",
                        "type": "boolean",
                        "description": "Output one text block per page instead of one block per document.",
                        "default": true
                    },
                    "maxPages": {
                        "title": "Max pages per PDF",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Stop extracting after this many pages per document.",
                        "default": 500
                    },
                    "includeMetadata": {
                        "title": "Include document metadata",
                        "type": "boolean",
                        "description": "Include the PDF's own metadata fields (title, author, producer, dates) as published in the file.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
