Pdf to json avatar

Pdf to json

Pricing

from $3.50 / 1,000 results

Go to Apify Store
Pdf to json

Pdf to json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Shahab Uddin

Shahab Uddin

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

2

Monthly active users

16 days ago

Last modified

Share

PDF to JSON API

This Apify Actor converts PDF files into normalized JSON. It accepts direct http / https PDF URLs for real workloads and ships with a bundled builtin://sample.pdf smoke-test input so Apify Store QA does not depend on a third-party sample URL staying online.

What it supports

  • Text extraction from standard text-based PDFs
  • Optional table extraction
  • Optional metadata output
  • Multiple PDF URLs per run
  • Apify dataset views, key-value store summaries, and live status page

Current limitations

  • OCR for image-only or scanned PDFs is not included in this version
  • Production URLs must be directly downloadable over http or https
  • maxPages limits parsed pages for text extraction, but pageCount still reports the full document page count reported by the PDF parser

Input example

{
"pdfUrls": [
"builtin://sample.pdf"
],
"extractTables": false,
"includeMetadata": true,
"outputFormat": "json",
"maxDownloadRetries": 4,
"requestTimeoutSecs": 30,
"saveDebugSnapshots": false,
"proxyConfiguration": {
"useApifyProxy": false
}
}

Apify QA compatibility

Apify Store's automated health check runs the Actor with its default input and expects a succeeded run with a non-empty default dataset within 5 minutes. To keep this Actor healthy:

  • The input schema now uses both prefill and default for pdfUrls, which avoids older tasks or integrations failing when the field is omitted.
  • The schema also pre-fills the lightweight default options, matching Apify's daily default-input health check path.
  • The default sample is bundled into the Actor as builtin://sample.pdf, so daily checks do not rely on a third-party PDF host.
  • The custom Dockerfile builds main.ts during the Apify image build and runs the generated dist/main.js, so production cannot drift from the TypeScript source.
  • The runtime falls back to the bundled sample when pdfUrls is omitted, which protects legacy runs that were created before the field existed.
  • The deprecated legacy inputs extractKeyValuePairs and useOcr are still accepted as hidden no-op fields so older saved Apify inputs do not fail validation.
  • JSON records in the default key-value store are written with an explicit application/json content type so they satisfy Apify's key-value-store schema validation rules for collections that use jsonSchema.
  • Real URL downloads now use retryable browser-like requests, optional Apify proxy support, and optional debug snapshots when a target serves HTML or an anti-bot page instead of a PDF.

If every PDF fails, the Actor now ends in a failed status instead of silently reporting a successful run with only error items.

Output example

{
"sourceUrl": "builtin://sample.pdf",
"fileName": "dummy.pdf",
"pageCount": 1,
"metadata": {
"PDFFormatVersion": "1.4"
},
"text": "Dummy PDF file",
"tables": [],
"success": true,
"processedAt": "2026-04-18T10:00:00.000Z"
}

Apify outputs

  • Dataset items: one normalized record per processed PDF
  • RUN_SUMMARY: compact run summary in the default key-value store
  • RESULTS.json or RESULTS.pretty.json: aggregated export of all dataset items
  • DEBUG_*: optional diagnostic metadata and HTML/text previews for blocked downloads when saveDebugSnapshots is enabled
  • Live view:
    • / HTML dashboard
    • /health compact JSON counters
    • /status full in-memory run state

Local development

npm install
npm run build
npm start

Run the same smoke path Apify cares about locally with:

$npm run smoke

For local TypeScript changes, rebuild before running or use:

$npm run dev

The runtime start command intentionally launches dist/main.js; the Docker build runs npm run build first and then prunes development dependencies before startup.

The actor source of truth is main.ts, and the Apify Console input UI is defined in .actor/INPUT_SCHEMA.json.