Pdf Json Extractor
Pricing
from $50.00 / 1,000 results
Pdf Json Extractor
Convert any PDF into structured JSON using AI and OCR (Tesseract or Google Vision). Supports custom schemas, validation, and auto-repair. Ideal for invoices, contracts, receipts, and automation workflows. Fast, accurate, and easy to integrate.
Pricing
from $50.00 / 1,000 results
Rating
0.0
(0)
Developer

Peerapat Pongnipakorn
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
PDF → Structured JSON Extractor (Apify Actor)
This Apify Actor extracts structured JSON from PDF files using PDF parsing + optional OCR + LLM-based schema extraction.
Features
- Accepts a
pdfUrl(HTTP) orpdfBase64(string) as input - Extracts raw text using
pdf-parseand optionally OCR (stub) - Sends the text and a user-provided
schemato an LLM to return strict JSON - Pushes extraction result to Dataset
Quick start
- Update
main.js'scallLLMfunction to call your chosen LLM provider (OpenAI, Anthropic, Google) - (Optional) Implement
runOCRusing Tesseract or a cloud OCR API apify pushto your Apify account and run the actor withinput.json
Example input.json
{"pdfUrl": "https://example.com/invoice123.pdf","schema": {"invoice_number": "string","invoice_date": "date","total_amount": "number","items": [{ "name": "string", "qty": "number", "price": "number" }]},"aiModel": "gpt-4o-mini","ocr": false,"returnFormat": "json"}
Notes
- The starter
callLLMfunction is a stub for testing and must be replaced with an actual LLM API call before production use. - Consider rate limits and cost of LLM calls. Offer batching or model selection in your product.
Suggested pricing
- Free: 20 PDFs / month
- Starter: $19 / month (200 PDFs)
- Pro: $49 / month (1000 PDFs)
- Business: $149 / month (10k PDFs)
Validation & LLM retry behavior
This Actor now validates the extracted JSON using ajv when you provide a JSON Schema as the schema input. If the JSON does not validate, the Actor will automatically attempt to repair it by sending a targeted prompt to the LLM (up to 2 repair attempts).
LLM calls use p-retry with exponential backoff for transient failures (retries on 5xx and rate-limit responses). You can control retry counts and model via the input parameters.
OCR Options (Tesseract or Google Vision)
This Actor supports optional OCR when ocr is enabled in the input. You can select the OCR engine via the input ocrOptions.engine field.
ocrOptions example
"ocr": true,"ocrOptions": { "engine": "tesseract" }
or for Google Vision:
"ocr": true,"ocrOptions": { "engine": "google" }
Tesseract (offline)
- Uses
tesseract.js(Node). This allows OCR without external APIs but adds a larger dependency. - No env vars needed. Install dependencies and run the Actor as usual.
Google Vision (cloud OCR)
- Uses Google Vision
DOCUMENT_TEXT_DETECTIONendpoint. RequiresGOOGLE_API_KEYenv var with an API key that has Vision API enabled. - Set the key in environment before running:
$export GOOGLE_API_KEY="YOUR_GOOGLE_VISION_API_KEY"
Behavior notes
- The Actor will attempt
pdf-parseextraction first. Ifocris true and extracted text is short or empty, the configured OCR engine will be invoked. - OCR can be slower and more expensive (Google Vision costs), so use it only for scanned PDFs.