Pricing

from $3.50 / 1,000 results

Pdf to json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Shahab Uddin

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

PDF to JSON API

This Apify Actor converts PDF files into normalized JSON. It accepts direct http / https PDF URLs for real workloads and ships with a bundled builtin://sample.pdf smoke-test input so Apify Store QA does not depend on a third-party sample URL staying online.

What it supports

Text extraction from standard text-based PDFs
Optional table extraction
Optional metadata output
Multiple PDF URLs per run
Apify dataset views, key-value store summaries, and live status page

Current limitations

OCR for image-only or scanned PDFs is not included in this version
Production URLs must be directly downloadable over http or https
maxPages limits parsed pages for text extraction, but pageCount still reports the full document page count reported by the PDF parser

Input example

{
  "pdfUrls": [
    "builtin://sample.pdf"
  ],
  "extractTables": false,
  "includeMetadata": true,
  "outputFormat": "json",
  "maxDownloadRetries": 4,
  "requestTimeoutSecs": 30,
  "saveDebugSnapshots": false,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

Apify QA compatibility

Apify Store's automated health check runs the Actor with its default input and expects a succeeded run with a non-empty default dataset within 5 minutes. To keep this Actor healthy:

The input schema now uses both prefill and default for pdfUrls, which avoids older tasks or integrations failing when the field is omitted.
The schema also pre-fills the lightweight default options, matching Apify's daily default-input health check path.
The default sample is bundled into the Actor as builtin://sample.pdf, so daily checks do not rely on a third-party PDF host.
The custom Dockerfile builds main.ts during the Apify image build and runs the generated dist/main.js, so production cannot drift from the TypeScript source.
The runtime falls back to the bundled sample when pdfUrls is omitted, which protects legacy runs that were created before the field existed.
The deprecated legacy inputs extractKeyValuePairs and useOcr are still accepted as hidden no-op fields so older saved Apify inputs do not fail validation.
JSON records in the default key-value store are written with an explicit application/json content type so they satisfy Apify's key-value-store schema validation rules for collections that use jsonSchema.
Real URL downloads now use retryable browser-like requests, optional Apify proxy support, and optional debug snapshots when a target serves HTML or an anti-bot page instead of a PDF.

If every PDF fails, the Actor now ends in a failed status instead of silently reporting a successful run with only error items.

Output example

{
  "sourceUrl": "builtin://sample.pdf",
  "fileName": "dummy.pdf",
  "pageCount": 1,
  "metadata": {
    "PDFFormatVersion": "1.4"
  },
  "text": "Dummy PDF file",
  "tables": [],
  "success": true,
  "processedAt": "2026-04-18T10:00:00.000Z"
}

Apify outputs

Dataset items: one normalized record per processed PDF
RUN_SUMMARY: compact run summary in the default key-value store
RESULTS.json or RESULTS.pretty.json: aggregated export of all dataset items
DEBUG_*: optional diagnostic metadata and HTML/text previews for blocked downloads when saveDebugSnapshots is enabled
Live view:
- / HTML dashboard
- /health compact JSON counters
- /status full in-memory run state

Local development

npm install
npm run build
npm start

Run the same smoke path Apify cares about locally with:

$npm run smoke

For local TypeScript changes, rebuild before running or use:

$npm run dev

The runtime start command intentionally launches dist/main.js; the Docker build runs npm run build first and then prunes development dependencies before startup.

The actor source of truth is main.ts, and the Apify Console input UI is defined in .actor/INPUT_SCHEMA.json.

PDF OCR API - Document Extraction

alizarin_refrigerator-owner/pdf-ocr-api

Extract text from PDFs including scanned documents. OCR processing, table extraction & structured data output. Process invoices, contracts & forms at scale.

The Howlers

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

Kumar Gagandeo

Pdf Json Extractor

p6t_p10n/pdf-json-extractor

Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.

Peerapat Pongnipakorn

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

Anass

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

Ale

Elite Document Ocr Lite

thepattyroller/elite-document-ocr-lite

Basic document text extraction and processing. Extract text from documents, analyze document structure, and extract structured data from invoices and receipts. Perfect for document automation workflows.

Logan Kiser

PDF to Markdown Converter

web.harvester/pdf-to-markdown-converter

Convert PDFs to clean Markdown with optional OCR for scanned documents. Uses PDF.js for text extraction and Tesseract.js for optical character recognition.

Web Harvester

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

ParseForge

5.0

PDF Table Extractor - Convert to JSON & CSV

ntriqpro/table-extract-pdf-mcp

Pull tables out of PDF documents automatically. Convert to JSON or CSV for data analysis.

daehwan kim

Convert Image to PDF and PDF to Image

akash9078/image-pdf-converter

Convert images (JPG, PNG, BMP, and more) into high-quality PDFs, or extract images from PDF files in seconds. Image–PDF Converter Pro delivers fast, reliable, and professional results for all your document and image conversion needs.