Pdf to json
Pricing
from $3.50 / 1,000 results
Pdf to json
Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.
Pricing
from $3.50 / 1,000 results
Rating
0.0
(0)
Developer
Shahab Uddin
Actor stats
0
Bookmarked
6
Total users
2
Monthly active users
16 days ago
Last modified
Categories
Share
PDF to JSON API
This Apify Actor converts PDF files into normalized JSON. It accepts direct http / https PDF URLs for real workloads and ships with a bundled builtin://sample.pdf smoke-test input so Apify Store QA does not depend on a third-party sample URL staying online.
What it supports
- Text extraction from standard text-based PDFs
- Optional table extraction
- Optional metadata output
- Multiple PDF URLs per run
- Apify dataset views, key-value store summaries, and live status page
Current limitations
- OCR for image-only or scanned PDFs is not included in this version
- Production URLs must be directly downloadable over
httporhttps maxPageslimits parsed pages for text extraction, butpageCountstill reports the full document page count reported by the PDF parser
Input example
{"pdfUrls": ["builtin://sample.pdf"],"extractTables": false,"includeMetadata": true,"outputFormat": "json","maxDownloadRetries": 4,"requestTimeoutSecs": 30,"saveDebugSnapshots": false,"proxyConfiguration": {"useApifyProxy": false}}
Apify QA compatibility
Apify Store's automated health check runs the Actor with its default input and expects a succeeded run with a non-empty default dataset within 5 minutes. To keep this Actor healthy:
- The input schema now uses both
prefillanddefaultforpdfUrls, which avoids older tasks or integrations failing when the field is omitted. - The schema also pre-fills the lightweight default options, matching Apify's daily default-input health check path.
- The default sample is bundled into the Actor as
builtin://sample.pdf, so daily checks do not rely on a third-party PDF host. - The custom Dockerfile builds
main.tsduring the Apify image build and runs the generateddist/main.js, so production cannot drift from the TypeScript source. - The runtime falls back to the bundled sample when
pdfUrlsis omitted, which protects legacy runs that were created before the field existed. - The deprecated legacy inputs
extractKeyValuePairsanduseOcrare still accepted as hidden no-op fields so older saved Apify inputs do not fail validation. - JSON records in the default key-value store are written with an explicit
application/jsoncontent type so they satisfy Apify's key-value-store schema validation rules for collections that usejsonSchema. - Real URL downloads now use retryable browser-like requests, optional Apify proxy support, and optional debug snapshots when a target serves HTML or an anti-bot page instead of a PDF.
If every PDF fails, the Actor now ends in a failed status instead of silently reporting a successful run with only error items.
Output example
{"sourceUrl": "builtin://sample.pdf","fileName": "dummy.pdf","pageCount": 1,"metadata": {"PDFFormatVersion": "1.4"},"text": "Dummy PDF file","tables": [],"success": true,"processedAt": "2026-04-18T10:00:00.000Z"}
Apify outputs
- Dataset items: one normalized record per processed PDF
RUN_SUMMARY: compact run summary in the default key-value storeRESULTS.jsonorRESULTS.pretty.json: aggregated export of all dataset itemsDEBUG_*: optional diagnostic metadata and HTML/text previews for blocked downloads whensaveDebugSnapshotsis enabled- Live view:
/HTML dashboard/healthcompact JSON counters/statusfull in-memory run state
Local development
npm installnpm run buildnpm start
Run the same smoke path Apify cares about locally with:
$npm run smoke
For local TypeScript changes, rebuild before running or use:
$npm run dev
The runtime start command intentionally launches dist/main.js; the Docker build runs npm run build first and then prunes development dependencies before startup.
The actor source of truth is main.ts, and the Apify Console input UI is defined in .actor/INPUT_SCHEMA.json.