# Pdf to json (`shahabuddin38/pdf-to-json`) Actor

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

- **URL**: https://apify.com/shahabuddin38/pdf-to-json.md
- **Developed by:** [Shahab Uddin](https://apify.com/shahabuddin38) (community)
- **Categories:** AI, Developer tools, Automation
- **Stats:** 1 total users, 0 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.50 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PDF to JSON API — Convert PDF Files to Structured JSON (TypeScript, Advanced OCR, Table & Field Extraction)

**PDF to JSON API** is a production-ready Apify Actor, written in TypeScript, that converts PDF files into clean, structured JSON. It supports advanced OCR for scanned PDFs, table extraction, key-value field extraction, and is designed for easy extension and integration.

Website: [caastleaapk.com](https://caastleaapk.com)

---

### Why use this PDF to JSON API?

- **TypeScript-powered** for safety, maintainability, and developer experience
- **Advanced OCR**: Extract text from scanned/image-based PDFs (extendable)
- **Table extraction**: Detect and extract tables from PDFs (customizable logic)
- **Key-value field extraction**: Extract structured fields for invoices, receipts, contracts, and more
- **Metadata extraction**: Capture PDF metadata for compliance and search
- **API-ready**: Use as a PDF parser API or document parsing API in your workflows
- **Dataset output**: Results are saved to Apify dataset for easy integration with Make, Zapier, n8n, and custom apps
- **Input schema**: UI for manual runs, input validation, and API consistency

---

### Features

- Convert digital and scanned PDFs to normalized JSON
- Optional advanced OCR mode (extendable with pdf-lib/pdfjs-dist + tesseract.js)
- Table and key-value extraction (custom logic supported)
- Modular, clean TypeScript codebase for easy extension
- Handles multiple PDF URLs per run
- Robust error handling and input validation
- Commercial-quality, ready for production and API use

---

### Use Cases

- Invoice, receipt, and bank statement parsing (with custom field extraction)
- Contract and compliance document analysis
- Resume and form extraction
- Research paper and report ingestion
- AI and LLM document preprocessing
- Internal knowledge base building

---

### Input Example

```json
{
  "pdfUrls": [
    "https://example.com/sample.pdf"
  ],
  "useOcr": true,
  "extractTables": true,
  "extractKeyValuePairs": true,
  "includeMetadata": true,
  "outputFormat": "json",
  "maxPages": 25,
  "timeoutSecs": 120
}
````

***

### Output Example

```json
{
  "sourceUrl": "https://example.com/sample.pdf",
  "fileName": "sample.pdf",
  "pageCount": 12,
  "metadata": {
    "title": "Sample PDF",
    "author": "Unknown"
  },
  "text": "Full extracted text goes here...",
  "tables": [
    {
      "page": 2,
      "rows": [
        ["Name", "Amount"],
        ["Invoice A", "1200"]
      ]
    }
  ],
  "keyValuePairs": {
    "invoice_number": "INV-1001",
    "total": "1200"
  },
  "success": true,
  "processedAt": "2026-04-12T10:00:00.000Z"
}
```

If processing fails:

```json
{
  "sourceUrl": "https://example.com/broken.pdf",
  "success": false,
  "error": "Unable to parse PDF"
}
```

***

### How it Works

1. Accepts one or more PDF URLs (or uploaded files if supported)
2. Downloads and inspects each PDF
3. Extracts text from digital PDFs
4. Uses advanced OCR for scanned/image-based PDFs (extendable in TypeScript)
5. Detects tables and key-value fields (custom logic possible)
6. Normalizes everything into a stable JSON schema
7. Saves results to the Apify dataset for API access and integrations

**TypeScript support:** The Actor is written in TypeScript for maintainability, type safety, and easy extension. Add your own advanced OCR, table, or field extraction logic in `main.ts`.

***

### API Usage with Apify Example

Run the Actor with the Apify API:

```bash
curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-to-json-api/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "pdfUrls": ["https://example.com/sample.pdf"],
    "useOcr": true,
    "extractTables": true,
    "extractKeyValuePairs": true,
    "includeMetadata": true,
    "outputFormat": "json"
  }'
```

After the run finishes, fetch results from the dataset:

```bash
curl "https://api.apify.com/v2/datasets/YOUR_DATASET_ID/items?clean=true&format=json"
```

***

### Extending the Actor (TypeScript)

- **Advanced OCR**: Integrate `pdf-lib` or `pdfjs-dist` to render PDF pages as images, then use `tesseract.js` for OCR. See the `extractTextFromPdf` function in `main.ts` for extension points.
- **Table Extraction**: Replace or enhance the default table extraction logic with ML models or custom heuristics in `extractTablesFromPdf`.
- **Key-Value Extraction**: Add regex, ML, or domain-specific logic in `extractKeyValuePairs` for invoices, receipts, contracts, etc.
- **API/Integration**: Use the Apify dataset output for downstream automation, RPA, or AI workflows.

***

### SEO-Friendly FAQ

**Is this Actor written in TypeScript?**\
Yes, the Actor is TypeScript-based for better code quality, maintainability, and extensibility.

**Can I add my own advanced OCR, table, or field extraction logic?**\
Absolutely. The codebase is modular and ready for you to plug in custom logic for invoices, receipts, contracts, and more. See `main.ts` for extension points.

**What is a PDF to JSON API?**\
A PDF to JSON API converts PDF documents into machine-readable JSON so the data can be searched, automated, and integrated into software systems.

**Can I convert PDF to JSON automatically?**\
Yes. This Actor is designed to convert PDF to JSON automatically using API input, optional OCR, and structured output saved to an Apify dataset.

**Does this support scanned PDFs?**\
Yes. With OCR enabled, you can process scanned or image-based documents for OCR PDF to JSON workflows.

**Is this a PDF parser API or document parsing API?**\
Both. It can be used as a PDF parser API and more generally as a document parsing API for structured extraction.

**Can I extract tables from PDF files?**\
Yes. The Actor can extract table-like structures and include them in the returned JSON.

**What types of documents work best?**\
Invoices, receipts, contracts, forms, reports, statements, resumes, and research PDFs are common use cases.

**Can I use this for AI workflows?**\
Yes. Structured JSON output is helpful for embeddings, retrieval pipelines, document classification, and LLM-based automation.

***

### Support

For custom integrations, advanced extraction, TypeScript consulting, or branded implementations, visit:

[caastleaapk.com](https://caastleaapk.com)

***

### License

MIT

# Actor input Schema

## `pdfUrls` (type: `array`):

List of direct PDF file URLs to process.

## `useOcr` (type: `boolean`):

Enable OCR for scanned/image-based PDFs.

## `extractTables` (type: `boolean`):

Detect and extract tables from PDFs.

## `extractKeyValuePairs` (type: `boolean`):

Extract key-value fields from documents.

## `includeMetadata` (type: `boolean`):

Extract and include PDF metadata.

## `outputFormat` (type: `string`):

Choose output format.

## `webhookUrl` (type: `string`):

Optional webhook to notify when processing is done.

## `proxyConfiguration` (type: `string`):

Proxy settings for downloading files (use 'apifyProxy' or leave blank).

## `maxPages` (type: `integer`):

Maximum number of pages to process per PDF.

## `timeoutSecs` (type: `integer`):

Maximum time to spend processing each PDF.

## Actor input object example

```json
{
  "useOcr": false,
  "extractTables": false,
  "extractKeyValuePairs": false,
  "includeMetadata": true,
  "outputFormat": "json"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("shahabuddin38/pdf-to-json").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("shahabuddin38/pdf-to-json").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call shahabuddin38/pdf-to-json --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=shahabuddin38/pdf-to-json",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Pdf to json",
        "description": "Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.",
        "version": "0.0",
        "x-build-id": "FNX2HUrBUZAYd9gm1"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/shahabuddin38~pdf-to-json/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-shahabuddin38-pdf-to-json",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/shahabuddin38~pdf-to-json/runs": {
            "post": {
                "operationId": "runs-sync-shahabuddin38-pdf-to-json",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/shahabuddin38~pdf-to-json/run-sync": {
            "post": {
                "operationId": "run-sync-shahabuddin38-pdf-to-json",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "pdfUrls"
                ],
                "properties": {
                    "pdfUrls": {
                        "title": "PDF URLs",
                        "minItems": 1,
                        "type": "array",
                        "description": "List of direct PDF file URLs to process.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "useOcr": {
                        "title": "Use OCR for scanned PDFs",
                        "type": "boolean",
                        "description": "Enable OCR for scanned/image-based PDFs.",
                        "default": false
                    },
                    "extractTables": {
                        "title": "Extract Tables",
                        "type": "boolean",
                        "description": "Detect and extract tables from PDFs.",
                        "default": false
                    },
                    "extractKeyValuePairs": {
                        "title": "Extract Key-Value Pairs",
                        "type": "boolean",
                        "description": "Extract key-value fields from documents.",
                        "default": false
                    },
                    "includeMetadata": {
                        "title": "Include Metadata",
                        "type": "boolean",
                        "description": "Extract and include PDF metadata.",
                        "default": true
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "json",
                            "pretty-json"
                        ],
                        "type": "string",
                        "description": "Choose output format.",
                        "default": "json"
                    },
                    "webhookUrl": {
                        "title": "Webhook URL",
                        "type": "string",
                        "description": "Optional webhook to notify when processing is done."
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "string",
                        "description": "Proxy settings for downloading files (use 'apifyProxy' or leave blank)."
                    },
                    "maxPages": {
                        "title": "Max Pages",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of pages to process per PDF."
                    },
                    "timeoutSecs": {
                        "title": "Timeout (seconds)",
                        "minimum": 10,
                        "type": "integer",
                        "description": "Maximum time to spend processing each PDF."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
