Pricing

from $0.03 / file processed

OCR & Document Extractor – PDF & Image to Text, JSON, Word

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Pricing

from $0.03 / file processed

Rating

0.0

(0)

Developer

Lofomachines

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

OCR & Document Extractor – PDF & Image to JSON, Markdown, Word, Text & HTML

Turn scanned PDFs and images into clean, structured, searchable text — in bulk. Upload your files (or paste links), pick your output formats, and get back ready-to-use JSON, Markdown, Word (DOCX), plain text, and HTML with tables, headings, and reading order preserved.

Fast, accurate, multilingual OCR for invoices, contracts, books, forms, receipts, research papers, ID documents, handwritten notes, and any scanned document — no setup, no code required.

✨ Why choose this OCR Actor?

📚 Bulk processing – Convert hundreds of PDFs and images in a single run.
🧠 Layout-aware extraction – Keeps titles, paragraphs, reading order, and tables intact instead of dumping jumbled text.
🗂️ Five output formats – Export to JSON, Markdown, DOCX, TXT, and HTML — choose one or all.
🌍 Multilingual – Recognizes 80+ languages including English, Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, and more.
📄 Any document – Scanned PDFs, photos of documents, screenshots, multi-page TIFFs, invoices, receipts, and forms.
🔍 Rich structured data – Page-by-page text, word and character counts, table detection, confidence scores, and downloadable files.
⚡ Fast & cost-efficient – Tuned for high throughput so you pay less for more pages.
🔌 API & integrations ready – Use it from the Apify API, JavaScript/Python SDKs, Zapier, Make, n8n, or any no-code tool.

🎯 Who is it for?

You are a...	Use it to...
Developer / Startup	Add OCR to your product without managing infrastructure or models.
Finance / Accounting team	Extract data from invoices, receipts, and statements into structured records.
Legal / Compliance	Make contracts and scanned filings searchable and editable.
Researcher / Academic	Digitize papers, books, and archives into Markdown or Word.
Data / Automation team	Feed clean text into RAG pipelines, LLMs, databases, and spreadsheets.
Operations / Back office	Convert paper-based forms into digital workflows.

🚀 How to use it

Upload your files using the Upload files button — or paste direct document links (URLs).
Choose your output formats — JSON, Markdown, DOCX, TXT, HTML (pick any combination).
Select the language (or leave on Automatic).
Click Start ▶️

That's it. When the run finishes, you'll find every document as a structured record in the dataset, plus downloadable files in the storage tab.

💡 No technical configuration needed. Performance, accuracy, and reliability are optimized for you out of the box.

📥 Input

Field	Description
Upload files	The PDFs or images you want to convert. Upload many at once.
Document links (URLs)	Optional. Direct links to PDFs or images to process.
Output formats	One or more of: JSON, Markdown, DOCX, TXT, HTML.
Document language	Main language of your documents, or Automatic.
Detect tables & layout	Keep on to preserve structure and tables; turn off for fastest plain-text extraction.

Supported file types

PDF · PNG · JPG / JPEG · WEBP · BMP · TIFF · GIF

Example input

{
  "documentUrls": [
    "https://example.com/invoice.pdf",
    "https://example.com/scanned-contract.png"
  ],
  "outputFormats": ["json", "markdown", "docx"],
  "language": "auto",
  "detectTablesAndLayout": true
}

📤 Output

Each processed document becomes one clean, structured record in the dataset. Generated files (Markdown, Word, TXT, HTML) are saved to storage and linked directly in each record.

Example output record

{
  "fileName": "invoice.pdf",
  "status": "succeeded",
  "language": "auto",
  "pageCount": 2,
  "wordCount": 384,
  "characterCount": 2197,
  "tableCount": 1,
  "averageConfidence": 0.985,
  "text": "INVOICE\nAcme Corp\n...",
  "markdown": "# INVOICE\n\n**Acme Corp**\n\n| Item | Qty | Price |\n|---|---|---|\n...",
  "pages": [
    {
      "pageNumber": 1,
      "width": 1654,
      "height": 2339,
      "text": "INVOICE ...",
      "markdown": "# INVOICE ...",
      "confidence": 0.987,
      "tableCount": 1
    }
  ],
  "tables": [
    { "page": 1, "rows": 5, "columns": 3, "html": "<table>...</table>" }
  ],
  "outputFiles": {
    "markdown": "https://api.apify.com/v2/key-value-stores/.../records/0001-invoice.md",
    "docx": "https://api.apify.com/v2/key-value-stores/.../records/0001-invoice.docx"
  },
  "processedAt": "2026-06-17T10:00:00.000Z"
}

You can export the full dataset to JSON, CSV, Excel, or XML with one click, or fetch it via the API.

💡 Popular use cases

PDF to Word converter – Turn scanned PDFs into editable DOCX files.
Invoice & receipt OCR – Extract totals, line items, and tables into structured data.
Image to text – Pull text from photos and screenshots.
Document digitization – Convert paper archives into searchable Markdown or HTML.
RAG & AI pipelines – Produce clean, LLM-ready Markdown for chatbots and knowledge bases.
Data entry automation – Replace manual typing with automated extraction.
Accessibility – Make scanned documents readable by screen readers.

🔗 Integrations & automation

Run this Actor on a schedule, trigger it from your app, or connect it to Zapier, Make, n8n, Google Sheets, Airtable, and more. Call it programmatically with the Apify API or the official JavaScript and Python clients, and pull results straight into your workflow.

❓ FAQ

What is OCR? OCR (Optical Character Recognition) converts text inside images and scanned PDFs into real, machine-readable, searchable text.

Can it handle multi-page PDFs? Yes. Every page is processed and returned individually and as a combined document.

Does it keep tables? Yes. Tables are detected and preserved in Markdown, HTML, Word, and the structured data — keep Detect tables & layout enabled.

Which languages are supported? 80+ languages, including all major European and Asian scripts. Leave the language on Automatic or select a specific one for best results.

What formats can I export? JSON, Markdown, Word (DOCX), plain text (TXT), and HTML — any combination.

How is it priced? You only pay for the platform resources your run consumes. The Actor is tuned to be fast and cost-efficient so you get more pages per credit.

Is my data private? Your files and results stay within your own Apify account storage and are not shared.

📈 Tips for best results

Use clear, high-resolution scans for the highest accuracy.
Keep Detect tables & layout on for documents with tables or complex structure.
For the fastest, cheapest plain-text extraction, turn layout detection off.
Select the exact document language when you know it.

Used with ❤️ in TUTTOTRASCRITTO.COM

Keywords

OCR, OCR API, PDF OCR, image to text, PDF to text, PDF to Word, scanned PDF to Word, document extraction, bulk OCR, invoice OCR, receipt OCR, text recognition, document parsing, PDF to Markdown, PDF to JSON, handwriting OCR, multilingual OCR, document to JSON, table extraction, PDF data extraction.

PDF to Structured Data (Excel/JSON) with OCR

nibble/pdf-to-structured-data

Convert PDF invoices, forms, statements and reports into clean structured JSON (text, tables, key/values) with an OCR fallback for scanned pages.

Simon Fletcher

PDF to Markdown — Tables + OCR, for RAG & AI Agents

lizaraco/pdf-to-markdown

Convert PDFs to clean markdown at scale: layout-aware text extraction, table handling, and a vision-model OCR tier for scanned or broken pages. Per-page transparency, never-fail runs.

Shawn Downs

PDF OCR Text Extractor — PDFs & Images to Text, 12+ Languages

vivid_astronaut/ocr-pdf-extractor

Extract text from PDFs and images with OCR in 12+ languages, including word-level detail, form fields, and tables. Send a file, get clean structured text — built for document digitization and data-entry automation.

BRAINIALL Team

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

NanoScrape

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

entranced_gelato/ai-document-reader

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

AIDevs

PDF OCR Tool — Scanned PDF Text Extraction

junipr/pdf-ocr-tool

Run OCR on scanned PDFs and image-based documents. Extract text by page with language options, confidence scores, and searchable text exports.

junipr

PDF to Markdown + OCR — Structured Document Extraction

kaz_kakyo/pdf-structured-ocr

Convert public PDF URLs into clean structured Markdown. Born-digital pages extract natively; scanned or image-only pages are recovered by a built-in OCR engine — no key or setup required. Per-page mode, confidence and partialFailures for PDF to Markdown and scanned PDF extraction pipelines.

Heim AI

PDF Extract — Text, Tables & Metadata (OCR-ready)

sathvic_kollu/techtenstein-pdf-extract

Extract clean text, structured tables, and metadata from any PDF URL. Supports OCR for scanned documents. Ideal for building document pipelines, financial data extraction, invoice processing, and research automation.

Techtenstein Services Private Limited

PDF & DOCX to Markdown — Document Extractor for LLM/RAG

fetchbase/document-to-markdown

Convert PDF and Word (DOCX) documents into clean Markdown, text, or JSON. Smart PDF paragraph reflow, page markers for RAG citations, full DOCX structure (headings, lists, tables), custom auth headers. No browser — parses in seconds. Charged per page processed — no startup fee.

Fetchbase

PDF Text Extractor – PDF to Text, Metadata & Pages

haketa/pdf-text-extractor

Extract clean text and metadata from any PDF by URL: full text, per-page text, page count, title, author, dates and producer. No browser, no OCR needed for text PDFs. Ideal for AI/RAG, search and document data extraction. Export to JSON, CSV or Excel.