Pdf to json avatar

Pdf to json

Pricing

from $3.50 / 1,000 results

Go to Apify Store
Pdf to json

Pdf to json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

Pricing

from $3.50 / 1,000 results

Rating

0.0

(0)

Developer

Shahab Uddin

Shahab Uddin

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

2 days ago

Last modified

Share

PDF to JSON API — Convert PDF Files to Structured JSON (TypeScript, Advanced OCR, Table & Field Extraction)

PDF to JSON API is a production-ready Apify Actor, written in TypeScript, that converts PDF files into clean, structured JSON. It supports advanced OCR for scanned PDFs, table extraction, key-value field extraction, and is designed for easy extension and integration.

Website: caastleaapk.com


Why use this PDF to JSON API?

  • TypeScript-powered for safety, maintainability, and developer experience
  • Advanced OCR: Extract text from scanned/image-based PDFs (extendable)
  • Table extraction: Detect and extract tables from PDFs (customizable logic)
  • Key-value field extraction: Extract structured fields for invoices, receipts, contracts, and more
  • Metadata extraction: Capture PDF metadata for compliance and search
  • API-ready: Use as a PDF parser API or document parsing API in your workflows
  • Dataset output: Results are saved to Apify dataset for easy integration with Make, Zapier, n8n, and custom apps
  • Input schema: UI for manual runs, input validation, and API consistency

Features

  • Convert digital and scanned PDFs to normalized JSON
  • Optional advanced OCR mode (extendable with pdf-lib/pdfjs-dist + tesseract.js)
  • Table and key-value extraction (custom logic supported)
  • Modular, clean TypeScript codebase for easy extension
  • Handles multiple PDF URLs per run
  • Robust error handling and input validation
  • Commercial-quality, ready for production and API use

Use Cases

  • Invoice, receipt, and bank statement parsing (with custom field extraction)
  • Contract and compliance document analysis
  • Resume and form extraction
  • Research paper and report ingestion
  • AI and LLM document preprocessing
  • Internal knowledge base building

Input Example

{
"pdfUrls": [
"https://example.com/sample.pdf"
],
"useOcr": true,
"extractTables": true,
"extractKeyValuePairs": true,
"includeMetadata": true,
"outputFormat": "json",
"maxPages": 25,
"timeoutSecs": 120
}

Output Example

{
"sourceUrl": "https://example.com/sample.pdf",
"fileName": "sample.pdf",
"pageCount": 12,
"metadata": {
"title": "Sample PDF",
"author": "Unknown"
},
"text": "Full extracted text goes here...",
"tables": [
{
"page": 2,
"rows": [
["Name", "Amount"],
["Invoice A", "1200"]
]
}
],
"keyValuePairs": {
"invoice_number": "INV-1001",
"total": "1200"
},
"success": true,
"processedAt": "2026-04-12T10:00:00.000Z"
}

If processing fails:

{
"sourceUrl": "https://example.com/broken.pdf",
"success": false,
"error": "Unable to parse PDF"
}

How it Works

  1. Accepts one or more PDF URLs (or uploaded files if supported)
  2. Downloads and inspects each PDF
  3. Extracts text from digital PDFs
  4. Uses advanced OCR for scanned/image-based PDFs (extendable in TypeScript)
  5. Detects tables and key-value fields (custom logic possible)
  6. Normalizes everything into a stable JSON schema
  7. Saves results to the Apify dataset for API access and integrations

TypeScript support: The Actor is written in TypeScript for maintainability, type safety, and easy extension. Add your own advanced OCR, table, or field extraction logic in main.ts.


API Usage with Apify Example

Run the Actor with the Apify API:

curl -X POST "https://api.apify.com/v2/acts/YOUR_USERNAME~pdf-to-json-api/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"pdfUrls": ["https://example.com/sample.pdf"],
"useOcr": true,
"extractTables": true,
"extractKeyValuePairs": true,
"includeMetadata": true,
"outputFormat": "json"
}'

After the run finishes, fetch results from the dataset:

$curl "https://api.apify.com/v2/datasets/YOUR_DATASET_ID/items?clean=true&format=json"

Extending the Actor (TypeScript)

  • Advanced OCR: Integrate pdf-lib or pdfjs-dist to render PDF pages as images, then use tesseract.js for OCR. See the extractTextFromPdf function in main.ts for extension points.
  • Table Extraction: Replace or enhance the default table extraction logic with ML models or custom heuristics in extractTablesFromPdf.
  • Key-Value Extraction: Add regex, ML, or domain-specific logic in extractKeyValuePairs for invoices, receipts, contracts, etc.
  • API/Integration: Use the Apify dataset output for downstream automation, RPA, or AI workflows.

SEO-Friendly FAQ

Is this Actor written in TypeScript?
Yes, the Actor is TypeScript-based for better code quality, maintainability, and extensibility.

Can I add my own advanced OCR, table, or field extraction logic?
Absolutely. The codebase is modular and ready for you to plug in custom logic for invoices, receipts, contracts, and more. See main.ts for extension points.

What is a PDF to JSON API?
A PDF to JSON API converts PDF documents into machine-readable JSON so the data can be searched, automated, and integrated into software systems.

Can I convert PDF to JSON automatically?
Yes. This Actor is designed to convert PDF to JSON automatically using API input, optional OCR, and structured output saved to an Apify dataset.

Does this support scanned PDFs?
Yes. With OCR enabled, you can process scanned or image-based documents for OCR PDF to JSON workflows.

Is this a PDF parser API or document parsing API?
Both. It can be used as a PDF parser API and more generally as a document parsing API for structured extraction.

Can I extract tables from PDF files?
Yes. The Actor can extract table-like structures and include them in the returned JSON.

What types of documents work best?
Invoices, receipts, contracts, forms, reports, statements, resumes, and research PDFs are common use cases.

Can I use this for AI workflows?
Yes. Structured JSON output is helpful for embeddings, retrieval pipelines, document classification, and LLM-based automation.


Support

For custom integrations, advanced extraction, TypeScript consulting, or branded implementations, visit:

caastleaapk.com


License

MIT