OCR & Document Extractor – PDF & Image to Text, JSON, Word avatar

OCR & Document Extractor – PDF & Image to Text, JSON, Word

Pricing

from $0.03 / file processed

Go to Apify Store
OCR & Document Extractor – PDF & Image to Text, JSON, Word

OCR & Document Extractor – PDF & Image to Text, JSON, Word

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Pricing

from $0.03 / file processed

Rating

0.0

(0)

Developer

Lofomachines

Lofomachines

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

OCR & Document Extractor – PDF & Image to JSON, Markdown, Word, Text & HTML

Turn scanned PDFs and images into clean, structured, searchable text — in bulk. Upload your files (or paste links), pick your output formats, and get back ready-to-use JSON, Markdown, Word (DOCX), plain text, and HTML with tables, headings, and reading order preserved.

Fast, accurate, multilingual OCR for invoices, contracts, books, forms, receipts, research papers, ID documents, handwritten notes, and any scanned document — no setup, no code required.


✨ Why choose this OCR Actor?

  • 📚 Bulk processing – Convert hundreds of PDFs and images in a single run.
  • 🧠 Layout-aware extraction – Keeps titles, paragraphs, reading order, and tables intact instead of dumping jumbled text.
  • 🗂️ Five output formats – Export to JSON, Markdown, DOCX, TXT, and HTML — choose one or all.
  • 🌍 Multilingual – Recognizes 80+ languages including English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, and more.
  • 📄 Any document – Scanned PDFs, photos of documents, screenshots, multi-page TIFFs, invoices, receipts, and forms.
  • 🔍 Rich structured data – Page-by-page text, word and character counts, table detection, confidence scores, and downloadable files.
  • Fast & cost-efficient – Tuned for high throughput so you pay less for more pages.
  • 🔌 API & integrations ready – Use it from the Apify API, JavaScript/Python SDKs, Zapier, Make, n8n, or any no-code tool.

🎯 Who is it for?

You are a...Use it to...
Developer / StartupAdd OCR to your product without managing infrastructure or models.
Finance / Accounting teamExtract data from invoices, receipts, and statements into structured records.
Legal / ComplianceMake contracts and scanned filings searchable and editable.
Researcher / AcademicDigitize papers, books, and archives into Markdown or Word.
Data / Automation teamFeed clean text into RAG pipelines, LLMs, databases, and spreadsheets.
Operations / Back officeConvert paper-based forms into digital workflows.

🚀 How to use it

  1. Upload your files using the Upload files button — or paste direct document links (URLs).
  2. Choose your output formats — JSON, Markdown, DOCX, TXT, HTML (pick any combination).
  3. Select the language (or leave on Automatic).
  4. Click Start ▶️

That's it. When the run finishes, you'll find every document as a structured record in the dataset, plus downloadable files in the storage tab.

💡 No technical configuration needed. Performance, accuracy, and reliability are optimized for you out of the box.


📥 Input

FieldDescription
Upload filesThe PDFs or images you want to convert. Upload many at once.
Document links (URLs)Optional. Direct links to PDFs or images to process.
Output formatsOne or more of: JSON, Markdown, DOCX, TXT, HTML.
Document languageMain language of your documents, or Automatic.
Detect tables & layoutKeep on to preserve structure and tables; turn off for fastest plain-text extraction.

Supported file types

PDF · PNG · JPG / JPEG · WEBP · BMP · TIFF · GIF

Example input

{
"documentUrls": [
"https://example.com/invoice.pdf",
"https://example.com/scanned-contract.png"
],
"outputFormats": ["json", "markdown", "docx"],
"language": "auto",
"detectTablesAndLayout": true
}

📤 Output

Each processed document becomes one clean, structured record in the dataset. Generated files (Markdown, Word, TXT, HTML) are saved to storage and linked directly in each record.

Example output record

{
"fileName": "invoice.pdf",
"status": "succeeded",
"language": "auto",
"pageCount": 2,
"wordCount": 384,
"characterCount": 2197,
"tableCount": 1,
"averageConfidence": 0.985,
"text": "INVOICE\nAcme Corp\n...",
"markdown": "# INVOICE\n\n**Acme Corp**\n\n| Item | Qty | Price |\n|---|---|---|\n...",
"pages": [
{
"pageNumber": 1,
"width": 1654,
"height": 2339,
"text": "INVOICE ...",
"markdown": "# INVOICE ...",
"confidence": 0.987,
"tableCount": 1
}
],
"tables": [
{ "page": 1, "rows": 5, "columns": 3, "html": "<table>...</table>" }
],
"outputFiles": {
"markdown": "https://api.apify.com/v2/key-value-stores/.../records/0001-invoice.md",
"docx": "https://api.apify.com/v2/key-value-stores/.../records/0001-invoice.docx"
},
"processedAt": "2026-06-17T10:00:00.000Z"
}

You can export the full dataset to JSON, CSV, Excel, or XML with one click, or fetch it via the API.


  • PDF to Word converter – Turn scanned PDFs into editable DOCX files.
  • Invoice & receipt OCR – Extract totals, line items, and tables into structured data.
  • Image to text – Pull text from photos and screenshots.
  • Document digitization – Convert paper archives into searchable Markdown or HTML.
  • RAG & AI pipelines – Produce clean, LLM-ready Markdown for chatbots and knowledge bases.
  • Data entry automation – Replace manual typing with automated extraction.
  • Accessibility – Make scanned documents readable by screen readers.

🔗 Integrations & automation

Run this Actor on a schedule, trigger it from your app, or connect it to Zapier, Make, n8n, Google Sheets, Airtable, and more. Call it programmatically with the Apify API or the official JavaScript and Python clients, and pull results straight into your workflow.


❓ FAQ

What is OCR? OCR (Optical Character Recognition) converts text inside images and scanned PDFs into real, machine-readable, searchable text.

Can it handle multi-page PDFs? Yes. Every page is processed and returned individually and as a combined document.

Does it keep tables? Yes. Tables are detected and preserved in Markdown, HTML, Word, and the structured data — keep Detect tables & layout enabled.

Which languages are supported? 80+ languages, including all major European and Asian scripts. Leave the language on Automatic or select a specific one for best results.

What formats can I export? JSON, Markdown, Word (DOCX), plain text (TXT), and HTML — any combination.

How is it priced? You only pay for the platform resources your run consumes. The Actor is tuned to be fast and cost-efficient so you get more pages per credit.

Is my data private? Your files and results stay within your own Apify account storage and are not shared.


📈 Tips for best results

  • Use clear, high-resolution scans for the highest accuracy.
  • Keep Detect tables & layout on for documents with tables or complex structure.
  • For the fastest, cheapest plain-text extraction, turn layout detection off.
  • Select the exact document language when you know it.

Keywords

OCR, OCR API, PDF OCR, image to text, PDF to text, PDF to Word, scanned PDF to Word, document extraction, bulk OCR, invoice OCR, receipt OCR, text recognition, document parsing, PDF to Markdown, PDF to JSON, handwriting OCR, multilingual OCR, document to JSON, table extraction, PDF data extraction.