# OCR & Document Extractor – PDF & Image to Text, JSON, Word (`lofomachines/ocr-document-extractor`) Actor

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

- **URL**: https://apify.com/lofomachines/ocr-document-extractor.md
- **Developed by:** [Lofomachines](https://apify.com/lofomachines) (community)
- **Categories:** Agents, Developer tools, AI
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.03 / file processed

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## OCR & Document Extractor – PDF & Image to JSON, Markdown, Word, Text & HTML

**Turn scanned PDFs and images into clean, structured, searchable text — in bulk.** Upload your files (or paste links), pick your output formats, and get back ready-to-use **JSON, Markdown, Word (DOCX), plain text, and HTML** with tables, headings, and reading order preserved.

Fast, accurate, multilingual OCR for invoices, contracts, books, forms, receipts, research papers, ID documents, handwritten notes, and any scanned document — no setup, no code required.

---

### ✨ Why choose this OCR Actor?

- 📚 **Bulk processing** – Convert hundreds of PDFs and images in a single run.
- 🧠 **Layout-aware extraction** – Keeps titles, paragraphs, reading order, and **tables** intact instead of dumping jumbled text.
- 🗂️ **Five output formats** – Export to **JSON, Markdown, DOCX, TXT, and HTML** — choose one or all.
- 🌍 **Multilingual** – Recognizes 80+ languages including English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, and more.
- 📄 **Any document** – Scanned PDFs, photos of documents, screenshots, multi-page TIFFs, invoices, receipts, and forms.
- 🔍 **Rich structured data** – Page-by-page text, word and character counts, table detection, confidence scores, and downloadable files.
- ⚡ **Fast & cost-efficient** – Tuned for high throughput so you pay less for more pages.
- 🔌 **API & integrations ready** – Use it from the Apify API, JavaScript/Python SDKs, Zapier, Make, n8n, or any no-code tool.

---

### 🎯 Who is it for?

| You are a... | Use it to... |
|---|---|
| **Developer / Startup** | Add OCR to your product without managing infrastructure or models. |
| **Finance / Accounting team** | Extract data from invoices, receipts, and statements into structured records. |
| **Legal / Compliance** | Make contracts and scanned filings searchable and editable. |
| **Researcher / Academic** | Digitize papers, books, and archives into Markdown or Word. |
| **Data / Automation team** | Feed clean text into RAG pipelines, LLMs, databases, and spreadsheets. |
| **Operations / Back office** | Convert paper-based forms into digital workflows. |

---

### 🚀 How to use it

1. **Upload your files** using the *Upload files* button — or paste direct **document links** (URLs).
2. **Choose your output formats** — JSON, Markdown, DOCX, TXT, HTML (pick any combination).
3. **Select the language** (or leave on *Automatic*).
4. Click **Start** ▶️

That's it. When the run finishes, you'll find every document as a structured record in the **dataset**, plus downloadable files in the **storage** tab.

> 💡 No technical configuration needed. Performance, accuracy, and reliability are optimized for you out of the box.

---

### 📥 Input

| Field | Description |
|---|---|
| **Upload files** | The PDFs or images you want to convert. Upload many at once. |
| **Document links (URLs)** | Optional. Direct links to PDFs or images to process. |
| **Output formats** | One or more of: JSON, Markdown, DOCX, TXT, HTML. |
| **Document language** | Main language of your documents, or *Automatic*. |
| **Detect tables & layout** | Keep on to preserve structure and tables; turn off for fastest plain-text extraction. |

#### Supported file types

`PDF` · `PNG` · `JPG / JPEG` · `WEBP` · `BMP` · `TIFF` · `GIF`

#### Example input

```json
{
  "documentUrls": [
    "https://example.com/invoice.pdf",
    "https://example.com/scanned-contract.png"
  ],
  "outputFormats": ["json", "markdown", "docx"],
  "language": "auto",
  "detectTablesAndLayout": true
}
````

***

### 📤 Output

Each processed document becomes **one clean, structured record** in the dataset. Generated files (Markdown, Word, TXT, HTML) are saved to storage and linked directly in each record.

#### Example output record

```json
{
  "fileName": "invoice.pdf",
  "status": "succeeded",
  "language": "auto",
  "pageCount": 2,
  "wordCount": 384,
  "characterCount": 2197,
  "tableCount": 1,
  "averageConfidence": 0.985,
  "text": "INVOICE\nAcme Corp\n...",
  "markdown": "# INVOICE\n\n**Acme Corp**\n\n| Item | Qty | Price |\n|---|---|---|\n...",
  "pages": [
    {
      "pageNumber": 1,
      "width": 1654,
      "height": 2339,
      "text": "INVOICE ...",
      "markdown": "# INVOICE ...",
      "confidence": 0.987,
      "tableCount": 1
    }
  ],
  "tables": [
    { "page": 1, "rows": 5, "columns": 3, "html": "<table>...</table>" }
  ],
  "outputFiles": {
    "markdown": "https://api.apify.com/v2/key-value-stores/.../records/0001-invoice.md",
    "docx": "https://api.apify.com/v2/key-value-stores/.../records/0001-invoice.docx"
  },
  "processedAt": "2026-06-17T10:00:00.000Z"
}
```

You can export the full dataset to **JSON, CSV, Excel, or XML** with one click, or fetch it via the API.

***

### 💡 Popular use cases

- **PDF to Word converter** – Turn scanned PDFs into editable DOCX files.
- **Invoice & receipt OCR** – Extract totals, line items, and tables into structured data.
- **Image to text** – Pull text from photos and screenshots.
- **Document digitization** – Convert paper archives into searchable Markdown or HTML.
- **RAG & AI pipelines** – Produce clean, LLM-ready Markdown for chatbots and knowledge bases.
- **Data entry automation** – Replace manual typing with automated extraction.
- **Accessibility** – Make scanned documents readable by screen readers.

***

### 🔗 Integrations & automation

Run this Actor on a schedule, trigger it from your app, or connect it to **Zapier, Make, n8n**, Google Sheets, Airtable, and more. Call it programmatically with the **Apify API** or the official **JavaScript** and **Python** clients, and pull results straight into your workflow.

***

### ❓ FAQ

**What is OCR?**
OCR (Optical Character Recognition) converts text inside images and scanned PDFs into real, machine-readable, searchable text.

**Can it handle multi-page PDFs?**
Yes. Every page is processed and returned individually and as a combined document.

**Does it keep tables?**
Yes. Tables are detected and preserved in Markdown, HTML, Word, and the structured data — keep *Detect tables & layout* enabled.

**Which languages are supported?**
80+ languages, including all major European and Asian scripts. Leave the language on *Automatic* or select a specific one for best results.

**What formats can I export?**
JSON, Markdown, Word (DOCX), plain text (TXT), and HTML — any combination.

**How is it priced?**
You only pay for the platform resources your run consumes. The Actor is tuned to be fast and cost-efficient so you get more pages per credit.

**Is my data private?**
Your files and results stay within your own Apify account storage and are not shared.

***

### 📈 Tips for best results

- Use clear, high-resolution scans for the highest accuracy.
- Keep *Detect tables & layout* on for documents with tables or complex structure.
- For the fastest, cheapest plain-text extraction, turn layout detection off.
- Select the exact document language when you know it.

***

#### Keywords

OCR, OCR API, PDF OCR, image to text, PDF to text, PDF to Word, scanned PDF to Word, document extraction, bulk OCR, invoice OCR, receipt OCR, text recognition, document parsing, PDF to Markdown, PDF to JSON, handwriting OCR, multilingual OCR, document to JSON, table extraction, PDF data extraction.

# Actor input Schema

## `files` (type: `array`):

Upload the PDFs or images you want to convert. Supported types include PDF, PNG, JPG, JPEG, WEBP, BMP, TIFF and GIF. You can upload many files at once.

## `documentUrls` (type: `array`):

Optionally paste direct links to PDFs or images instead of (or in addition to) uploading. One link per row.

## `outputFormats` (type: `array`):

Choose one or more formats to receive. Every result is always available as structured data; pick any additional downloadable files you need.

## `language` (type: `string`):

The main language of your documents. Leave on Automatic for best results.

## `extractTables` (type: `boolean`):

Detect tables in documents and include them in the results. Leave on unless you only need plain text.

## Actor input object example

```json
{
  "files": [],
  "documentUrls": [],
  "outputFormats": [
    "json",
    "markdown"
  ],
  "language": "auto",
  "extractTables": true
}
```

# Actor output Schema

## `results` (type: `string`):

Structured results, one record per processed document.

## `files` (type: `string`):

Downloadable Markdown, DOCX, TXT and HTML files generated from your documents.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "files": [],
    "documentUrls": [],
    "outputFormats": [
        "json",
        "markdown"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("lofomachines/ocr-document-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "files": [],
    "documentUrls": [],
    "outputFormats": [
        "json",
        "markdown",
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("lofomachines/ocr-document-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "files": [],
  "documentUrls": [],
  "outputFormats": [
    "json",
    "markdown"
  ]
}' |
apify call lofomachines/ocr-document-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=lofomachines/ocr-document-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "OCR & Document Extractor – PDF & Image to Text, JSON, Word",
        "description": "Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.",
        "version": "1.0",
        "x-build-id": "X44ZfNSkFlmeMJ1wC"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/lofomachines~ocr-document-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-lofomachines-ocr-document-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/lofomachines~ocr-document-extractor/runs": {
            "post": {
                "operationId": "runs-sync-lofomachines-ocr-document-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/lofomachines~ocr-document-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-lofomachines-ocr-document-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "files": {
                        "title": "Upload files",
                        "type": "array",
                        "description": "Upload the PDFs or images you want to convert. Supported types include PDF, PNG, JPG, JPEG, WEBP, BMP, TIFF and GIF. You can upload many files at once."
                    },
                    "documentUrls": {
                        "title": "Document links (URLs)",
                        "type": "array",
                        "description": "Optionally paste direct links to PDFs or images instead of (or in addition to) uploading. One link per row.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "outputFormats": {
                        "title": "Output formats",
                        "type": "array",
                        "description": "Choose one or more formats to receive. Every result is always available as structured data; pick any additional downloadable files you need.",
                        "items": {
                            "type": "string",
                            "enum": [
                                "json",
                                "markdown",
                                "docx",
                                "txt",
                                "html"
                            ],
                            "enumTitles": [
                                "JSON (structured data)",
                                "Markdown (.md)",
                                "Word document (.docx)",
                                "Plain text (.txt)",
                                "Web page (.html)"
                            ]
                        },
                        "default": [
                            "json",
                            "markdown"
                        ]
                    },
                    "language": {
                        "title": "Document language",
                        "enum": [
                            "auto",
                            "en",
                            "it",
                            "es",
                            "fr",
                            "de",
                            "pt",
                            "ch"
                        ],
                        "type": "string",
                        "description": "The main language of your documents. Leave on Automatic for best results.",
                        "default": "auto"
                    },
                    "extractTables": {
                        "title": "Extract tables",
                        "type": "boolean",
                        "description": "Detect tables in documents and include them in the results. Leave on unless you only need plain text.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
