# PDF to JSON Parser (`jungle_synthesizer/pdf-to-json-parser`) Actor

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

- **URL**: https://apify.com/jungle\_synthesizer/pdf-to-json-parser.md
- **Developed by:** [BowTiedRaccoon](https://apify.com/jungle_synthesizer) (community)
- **Categories:** AI, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PDF to JSON Parser

Convert PDF documents into structured JSON. Supply a list of public PDF URLs — the actor downloads each file, extracts text from every page, and returns clean, organized output. Add your OpenAI API key to get an AI-powered structuring pass that turns raw text into categorized JSON fields.

### What it does

- Accepts a list of public PDF URLs (up to 50 MB per file)
- Downloads each PDF to temporary storage and extracts text per page using native PDF parsing
- Processes every page for complete coverage — no pages skipped
- Optionally runs an AI structuring pass (OpenAI GPT-4o-mini or GPT-4o) that organizes the raw text into titled sections, tables, key fields, and metadata
- Returns one dataset record per PDF with the full extracted text, per-page breakdown, and AI output
- Saves error records for PDFs that fail to download or parse — the run continues

### Use cases

- Invoice and receipt extraction for accounting automation
- Contract and legal document analysis
- Academic paper indexing and summarization
- Form data extraction from government or regulatory PDFs
- Report parsing for data pipelines
- Bulk document conversion for RAG / LLM pipelines

### Input

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `pdfUrls` | Array | Yes | Public PDF URLs to process. Must be directly downloadable. |
| `openaiApiKey` | String | No | Your OpenAI API key (`sk-...`). Enables AI structuring. Not stored. |
| `extractionPrompt` | String | No | Custom prompt for the AI structuring pass. Leave blank to use the default (extracts title, author, summary, sections, tables, key fields). |
| `model` | Select | No | OpenAI model: `gpt-4o-mini` (default, fast) or `gpt-4o` (most capable). |
| `maxItems` | Integer | No | Maximum PDFs to process per run. Default: 15. |

### Output

One dataset record per PDF:

| Field | Type | Description |
|-------|------|-------------|
| `sourceUrl` | String | Original PDF URL |
| `pageCount` | Number | Number of pages in the PDF |
| `rawText` | String | Full extracted text (all pages concatenated) |
| `pages` | String | JSON array of per-page text: `[{"page": 1, "text": "..."}]` |
| `structuredJson` | String | AI-structured output as JSON string (null if no API key supplied) |
| `model` | String | OpenAI model used (null if AI pass skipped) |
| `processedAt` | String | ISO timestamp when processing completed |
| `status` | String | `success` or `error` |
| `errorMsg` | String | Error message on failure, null on success |

#### Example record (native extraction only)

```json
{
  "sourceUrl": "https://example.com/invoice-2024-01.pdf",
  "pageCount": 2,
  "rawText": "Invoice #INV-2024-001\nDate: January 15, 2024\n...",
  "pages": "[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"},{\"page\":2,\"text\":\"Payment terms...\"}]",
  "structuredJson": null,
  "model": null,
  "processedAt": "2026-06-07T12:00:00.000Z",
  "status": "success",
  "errorMsg": null
}
````

#### Example record (with AI structuring)

```json
{
  "sourceUrl": "https://example.com/invoice-2024-01.pdf",
  "pageCount": 2,
  "rawText": "Invoice #INV-2024-001\nDate: January 15, 2024\n...",
  "pages": "[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"}]",
  "structuredJson": "{\"title\":\"Invoice #INV-2024-001\",\"date\":\"January 15, 2024\",\"key_fields\":{\"invoice_number\":\"INV-2024-001\",\"amount\":\"$1,250.00\"}}",
  "model": "gpt-4o-mini",
  "processedAt": "2026-06-07T12:00:00.000Z",
  "status": "success",
  "errorMsg": null
}
```

### Notes

- **Native extraction** works on any text-based PDF (invoices, reports, forms, contracts). Scanned image-only PDFs return empty text — OCR for image PDFs is not currently supported.
- **AI structuring** is additive. Even when the OpenAI call fails (rate limit, invalid key, network error), the actor returns the native extraction record with `structuredJson: null` rather than failing the run.
- **Custom prompts** let you tailor the structuring output for a specific document type. For example: `"Extract all line items as an array of {description, quantity, unit_price, total}"`.
- **File size limit**: 50 MB per PDF. Larger files are rejected with an error record.
- **OpenAI costs** are billed to your API key separately from actor usage.

# Actor input Schema

## `sp_intended_usage` (type: `string`):

Please describe how you plan to use the data extracted by this crawler.

## `sp_improvement_suggestions` (type: `string`):

Provide any feedback or suggestions for improvements.

## `sp_contact` (type: `string`):

Provide your email address so we can get in touch with you.

## `pdfUrls` (type: `array`):

List of public PDF URLs to process. Each URL must be directly downloadable (no login required). Max file size: 50 MB per PDF.

## `openaiApiKey` (type: `string`):

Your OpenAI API key (sk-...). Optional. When provided, the actor runs an AI structuring pass on the extracted text and returns structured JSON in the structuredJson field. Key is not stored or logged.

## `extractionPrompt` (type: `string`):

Custom prompt for the AI structuring pass. Tells the model what fields to extract and how to structure the output JSON. Leave blank to use the default prompt (general-purpose document extraction: title, author, summary, key\_fields, tables).

## `model` (type: `string`):

OpenAI model for the AI structuring pass.

## `maxItems` (type: `integer`):

Maximum number of PDFs to process per run.

## Actor input object example

```json
{
  "sp_intended_usage": "Describe your intended use...",
  "sp_improvement_suggestions": "Share your suggestions here...",
  "sp_contact": "Share your email here...",
  "pdfUrls": [
    "https://www.w3.org/WAI/WCAG21/wcag21.pdf"
  ],
  "extractionPrompt": "",
  "model": "gpt-4o-mini",
  "maxItems": 3
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "sp_intended_usage": "Describe your intended use...",
    "sp_improvement_suggestions": "Share your suggestions here...",
    "sp_contact": "Share your email here...",
    "pdfUrls": [
        "https://www.w3.org/WAI/WCAG21/wcag21.pdf"
    ],
    "maxItems": 3
};

// Run the Actor and wait for it to finish
const run = await client.actor("jungle_synthesizer/pdf-to-json-parser").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "sp_intended_usage": "Describe your intended use...",
    "sp_improvement_suggestions": "Share your suggestions here...",
    "sp_contact": "Share your email here...",
    "pdfUrls": ["https://www.w3.org/WAI/WCAG21/wcag21.pdf"],
    "maxItems": 3,
}

# Run the Actor and wait for it to finish
run = client.actor("jungle_synthesizer/pdf-to-json-parser").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "sp_intended_usage": "Describe your intended use...",
  "sp_improvement_suggestions": "Share your suggestions here...",
  "sp_contact": "Share your email here...",
  "pdfUrls": [
    "https://www.w3.org/WAI/WCAG21/wcag21.pdf"
  ],
  "maxItems": 3
}' |
apify call jungle_synthesizer/pdf-to-json-parser --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=jungle_synthesizer/pdf-to-json-parser",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PDF to JSON Parser",
        "description": "Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.",
        "version": "0.1",
        "x-build-id": "ElOcBo6GYZdWUsYba"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/jungle_synthesizer~pdf-to-json-parser/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-jungle_synthesizer-pdf-to-json-parser",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/jungle_synthesizer~pdf-to-json-parser/runs": {
            "post": {
                "operationId": "runs-sync-jungle_synthesizer-pdf-to-json-parser",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/jungle_synthesizer~pdf-to-json-parser/run-sync": {
            "post": {
                "operationId": "run-sync-jungle_synthesizer-pdf-to-json-parser",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "sp_intended_usage",
                    "sp_improvement_suggestions",
                    "pdfUrls"
                ],
                "properties": {
                    "sp_intended_usage": {
                        "title": "What is the intended usage of this data?",
                        "minLength": 1,
                        "type": "string",
                        "description": "Please describe how you plan to use the data extracted by this crawler."
                    },
                    "sp_improvement_suggestions": {
                        "title": "How can we improve this crawler for you?",
                        "minLength": 1,
                        "type": "string",
                        "description": "Provide any feedback or suggestions for improvements."
                    },
                    "sp_contact": {
                        "title": "Contact Email",
                        "minLength": 1,
                        "type": "string",
                        "description": "Provide your email address so we can get in touch with you."
                    },
                    "pdfUrls": {
                        "title": "PDF URLs",
                        "type": "array",
                        "description": "List of public PDF URLs to process. Each URL must be directly downloadable (no login required). Max file size: 50 MB per PDF.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "openaiApiKey": {
                        "title": "OpenAI API Key (Optional)",
                        "type": "string",
                        "description": "Your OpenAI API key (sk-...). Optional. When provided, the actor runs an AI structuring pass on the extracted text and returns structured JSON in the structuredJson field. Key is not stored or logged."
                    },
                    "extractionPrompt": {
                        "title": "Extraction Prompt",
                        "type": "string",
                        "description": "Custom prompt for the AI structuring pass. Tells the model what fields to extract and how to structure the output JSON. Leave blank to use the default prompt (general-purpose document extraction: title, author, summary, key_fields, tables).",
                        "default": ""
                    },
                    "model": {
                        "title": "OpenAI Model",
                        "enum": [
                            "gpt-4o-mini",
                            "gpt-4o"
                        ],
                        "type": "string",
                        "description": "OpenAI model for the AI structuring pass.",
                        "default": "gpt-4o-mini"
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "type": "integer",
                        "description": "Maximum number of PDFs to process per run.",
                        "default": 15
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
