# AI Data Extraction from PDF (`actor4you/ai-data-extraction-from-pdf`) Actor

Extract text data from PDF files using AI. Upload PDFs directly or provide URLs. Supports text chunking for LLM workflows.

- **URL**: https://apify.com/actor4you/ai-data-extraction-from-pdf.md
- **Developed by:** [Actor4you](https://apify.com/actor4you) (community)
- **Categories:** AI, Automation, Developer tools
- **Stats:** 2 total users, 1 monthly users, 0.0% runs succeeded, NaN bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### What is AI Data Extraction from PDF?

**AI Data Extraction from PDF** is a cloud-based tool that lets you **extract text from PDF** documents at scale. **Upload PDF files directly** in the Apify Console or provide **URLs to PDF files** hosted online - no coding required. This powerful **PDF text extractor** supports **text chunking** for seamless integration with LLM and RAG pipelines, making it the go-to **PDF scraper** for batch processing.

### What can AI Data Extraction from PDF do?

- **Dual input method** - Upload PDFs directly or paste URLs to online PDF files. No other pdf scraper gives you this flexibility.

- **Smart text chunking** - Split extracted text into configurable chunks with customizable overlap, purpose-built for RAG and AI workflows.

- **Batch PDF processing** - Process hundreds of PDF documents in a single run. Convert PDF to text format at scale.

- **REST API access** - Call the text extraction API programmatically from any language or platform using the Apify API.

- **Scheduling** - Set up recurring runs to process new PDFs automatically on a schedule.

- **Webhooks & integrations** - Connect to Slack, Google Sheets, Zapier, Make, or your own endpoints. Get notified when PDF data extraction completes.

- **Cloud-based** - No local installation, no dependencies. Runs on Apify's infrastructure with automatic scaling.

- **Export anywhere** - Download results as JSON, CSV, XML, or Excel. Push data directly to databases or APIs.

### What data can you extract from PDF?

| Field | Type | Description |
|-------|------|-------------|
| `url` | String | Source URL of the processed PDF file |
| `index` | Number | Page or chunk number (starting from 0) |
| `text` | String | Extracted text content - clean, structured, and ready for processing |

Each PDF produces one or more dataset items depending on the number of pages and your chunking configuration. The output is structured for immediate use in data pipelines, spreadsheets, or AI applications.

### How to use AI Data Extraction from PDF to extract text

1. **Go to the Actor page** - Navigate to [AI Data Extraction from PDF](https://apify.com/actor4you/ai-data-extraction-from-pdf) on Apify Store and click **Try for free**.
2. **Upload your PDFs or add URLs** - Use the **Upload PDF Files** field to drag and drop your documents, or paste direct links into the **PDF URLs** field. You can use both methods simultaneously.
3. **Configure chunking (optional)** - Toggle **Perform Chunking** if you need the text split into smaller segments. Set your preferred **Chunk Size** (characters per chunk) and **Chunk Overlap** (characters shared between consecutive chunks).
4. **Start the extraction** - Click **Start** and wait for the run to complete. The Actor processes each PDF and pushes extracted text to the dataset.
5. **Download your data** - Open the **Dataset** tab to preview results. Export as JSON, CSV, XML, or Excel, or access results via the API.

### How much does it cost to extract data from PDF?

AI Data Extraction from PDF runs on the [Apify Free plan](https://apify.com/pricing), which gives you $5 of free platform credits every month. A typical PDF extraction run costs well under $0.01 per document, meaning you can process **hundreds of PDFs for free** each month.

For higher volumes, paid plans offer more compute and storage. Platform usage is billed per compute unit consumed - there is no per-document fee. Check the [Apify pricing page](https://apify.com/pricing) for current rates.

### Input - configuration options

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `pdfFiles` | File Upload (array) | - | Upload one or more PDF files directly in the Apify Console. Files are stored in a key-value store and processed automatically. |
| `urls` | String List (array) | - | URLs of PDF files hosted online. Paste direct links to `.pdf` files. |
| `performChunking` | Boolean | `false` | Enable text chunking to split extracted content into smaller segments. Essential for LLM and RAG workflows. |
| `chunkSize` | Integer | `1000` | Maximum number of characters per chunk. Only applies when chunking is enabled. |
| `chunkOverlap` | Integer | `0` | Number of overlapping characters between consecutive chunks. Helps preserve context at chunk boundaries. |

You must provide at least one PDF - either via upload or URL. Both input methods can be used together in the same run.

### Output example - extracted text from PDF

```json
[
    {
        "url": "https://example.com/report-2024.pdf",
        "index": 0,
        "text": "Annual Report 2024. Executive Summary. This report presents the financial results and strategic initiatives undertaken during the fiscal year 2024. Total revenue increased by 12% year-over-year, driven primarily by growth in digital services..."
    },
    {
        "url": "https://example.com/report-2024.pdf",
        "index": 1,
        "text": "...driven primarily by growth in digital services and international expansion. Operating margins improved to 18.3%, reflecting cost optimization measures implemented in Q2. The company invested $45M in research and development..."
    },
    {
        "url": "https://example.com/invoice-march.pdf",
        "index": 0,
        "text": "Invoice #INV-2024-0342. Date: March 15, 2024. Bill To: Acme Corporation. Description: Cloud infrastructure services - March 2024. Amount: $12,500.00. Payment Terms: Net 30."
    }
]
````

### Use cases - who should use this PDF data extraction tool?

- **Finance & accounting** - Extract data from invoices, receipts, and financial statements. Automate document-to-text conversion for bookkeeping workflows.

- **Research & academia** - Pull text from research papers, journals, and academic PDFs. Build searchable databases of scientific literature.

- **Business intelligence** - Convert PDF reports into structured data for analysis. Feed quarterly reports, market research, and white papers into your data pipeline.

- **AI & LLM pipelines** - Use the built-in chunking feature to prepare PDF content for retrieval-augmented generation (RAG). Feed properly sized text chunks directly into vector databases or language models.

- **Legal document processing** - Extract text from contracts, court filings, and regulatory documents. Process large volumes of legal PDFs for review and analysis.

- **Enterprise batch processing** - Process hundreds of PDFs in a single run. Schedule daily or weekly extractions for incoming document streams using Apify's scheduling and webhook features.

### FAQ - PDF data extraction questions

#### Is it legal to extract text from PDF files?

Yes. Extracting text from PDF files you own or have permission to access is perfectly legal. This tool processes the documents you provide - it does not scrape third-party websites. Always ensure you have the right to process the PDFs you upload or link to.

#### Can this tool handle scanned PDFs or images inside PDFs?

This Actor works best with **text-based PDFs** - documents where the text is embedded as selectable content. Scanned PDFs that contain only images may return limited or no text. For scanned documents, consider using an OCR-capable tool first, then processing the output with this Actor.

#### How does text chunking work, and when should I use it?

When **Perform Chunking** is enabled, the extracted text is split into segments of up to `chunkSize` characters. The `chunkOverlap` parameter controls how many characters are shared between consecutive chunks, which helps preserve context at boundaries. Use chunking when you plan to feed the text into a large language model, vector database, or any system with token or character limits.

#### Is there a limit on the number or size of PDFs I can process?

There is no hard limit on the number of PDFs per run. Processing time and cost scale with the total volume of data. Very large PDFs (hundreds of pages) will produce more dataset items and use more compute time. For extremely large batches, consider splitting your input across multiple runs.

#### What output formats are available?

The Actor outputs structured data to an Apify Dataset. You can export results as **JSON**, **CSV**, **XML**, **Excel**, or **RSS**. You can also access the data programmatically via the Apify API, or push it directly to external services using integrations and webhooks.

# Actor input Schema

## `pdfFiles` (type: `array`):

Upload PDF files directly. Files will be stored in a key-value store and processed automatically.

## `urls` (type: `array`):

URLs of PDF files to extract data from. Use this if your PDFs are already hosted online.

## `performChunking` (type: `boolean`):

Whether to split the extracted text into smaller chunks. Useful for preparing data for large language models.

## `chunkSize` (type: `integer`):

Maximum number of characters per chunk.

## `chunkOverlap` (type: `integer`):

Number of overlapping characters between consecutive chunks.

## Actor input object example

```json
{
  "performChunking": false,
  "chunkSize": 1000,
  "chunkOverlap": 0
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("actor4you/ai-data-extraction-from-pdf").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("actor4you/ai-data-extraction-from-pdf").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call actor4you/ai-data-extraction-from-pdf --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=actor4you/ai-data-extraction-from-pdf",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "AI Data Extraction from PDF",
        "description": "Extract text data from PDF files using AI. Upload PDFs directly or provide URLs. Supports text chunking for LLM workflows.",
        "version": "0.0",
        "x-build-id": "zTccSDXDoOj7MAMjB"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/actor4you~ai-data-extraction-from-pdf/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-actor4you-ai-data-extraction-from-pdf",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/actor4you~ai-data-extraction-from-pdf/runs": {
            "post": {
                "operationId": "runs-sync-actor4you-ai-data-extraction-from-pdf",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/actor4you~ai-data-extraction-from-pdf/run-sync": {
            "post": {
                "operationId": "run-sync-actor4you-ai-data-extraction-from-pdf",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "pdfFiles": {
                        "title": "Upload PDF Files",
                        "type": "array",
                        "description": "Upload PDF files directly. Files will be stored in a key-value store and processed automatically."
                    },
                    "urls": {
                        "title": "PDF URLs",
                        "type": "array",
                        "description": "URLs of PDF files to extract data from. Use this if your PDFs are already hosted online.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "performChunking": {
                        "title": "Perform Chunking",
                        "type": "boolean",
                        "description": "Whether to split the extracted text into smaller chunks. Useful for preparing data for large language models.",
                        "default": false
                    },
                    "chunkSize": {
                        "title": "Chunk Size",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum number of characters per chunk.",
                        "default": 1000
                    },
                    "chunkOverlap": {
                        "title": "Chunk Overlap",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Number of overlapping characters between consecutive chunks.",
                        "default": 0
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
