PDF to JSON Parser avatar

PDF to JSON Parser

Pricing

Pay per event

Go to Apify Store
PDF to JSON Parser

PDF to JSON Parser

Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Convert PDF documents into structured JSON. Supply a list of public PDF URLs — the actor downloads each file, extracts text from every page, and returns clean, organized output. Add your OpenAI API key to get an AI-powered structuring pass that turns raw text into categorized JSON fields.

What it does

  • Accepts a list of public PDF URLs (up to 50 MB per file)
  • Downloads each PDF to temporary storage and extracts text per page using native PDF parsing
  • Processes every page for complete coverage — no pages skipped
  • Optionally runs an AI structuring pass (OpenAI GPT-4o-mini or GPT-4o) that organizes the raw text into titled sections, tables, key fields, and metadata
  • Returns one dataset record per PDF with the full extracted text, per-page breakdown, and AI output
  • Saves error records for PDFs that fail to download or parse — the run continues

Use cases

  • Invoice and receipt extraction for accounting automation
  • Contract and legal document analysis
  • Academic paper indexing and summarization
  • Form data extraction from government or regulatory PDFs
  • Report parsing for data pipelines
  • Bulk document conversion for RAG / LLM pipelines

Input

FieldTypeRequiredDescription
pdfUrlsArrayYesPublic PDF URLs to process. Must be directly downloadable.
openaiApiKeyStringNoYour OpenAI API key (sk-...). Enables AI structuring. Not stored.
extractionPromptStringNoCustom prompt for the AI structuring pass. Leave blank to use the default (extracts title, author, summary, sections, tables, key fields).
modelSelectNoOpenAI model: gpt-4o-mini (default, fast) or gpt-4o (most capable).
maxItemsIntegerNoMaximum PDFs to process per run. Default: 15.

Output

One dataset record per PDF:

FieldTypeDescription
sourceUrlStringOriginal PDF URL
pageCountNumberNumber of pages in the PDF
rawTextStringFull extracted text (all pages concatenated)
pagesStringJSON array of per-page text: [{"page": 1, "text": "..."}]
structuredJsonStringAI-structured output as JSON string (null if no API key supplied)
modelStringOpenAI model used (null if AI pass skipped)
processedAtStringISO timestamp when processing completed
statusStringsuccess or error
errorMsgStringError message on failure, null on success

Example record (native extraction only)

{
"sourceUrl": "https://example.com/invoice-2024-01.pdf",
"pageCount": 2,
"rawText": "Invoice #INV-2024-001\nDate: January 15, 2024\n...",
"pages": "[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"},{\"page\":2,\"text\":\"Payment terms...\"}]",
"structuredJson": null,
"model": null,
"processedAt": "2026-06-07T12:00:00.000Z",
"status": "success",
"errorMsg": null
}

Example record (with AI structuring)

{
"sourceUrl": "https://example.com/invoice-2024-01.pdf",
"pageCount": 2,
"rawText": "Invoice #INV-2024-001\nDate: January 15, 2024\n...",
"pages": "[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"}]",
"structuredJson": "{\"title\":\"Invoice #INV-2024-001\",\"date\":\"January 15, 2024\",\"key_fields\":{\"invoice_number\":\"INV-2024-001\",\"amount\":\"$1,250.00\"}}",
"model": "gpt-4o-mini",
"processedAt": "2026-06-07T12:00:00.000Z",
"status": "success",
"errorMsg": null
}

Notes

  • Native extraction works on any text-based PDF (invoices, reports, forms, contracts). Scanned image-only PDFs return empty text — OCR for image PDFs is not currently supported.
  • AI structuring is additive. Even when the OpenAI call fails (rate limit, invalid key, network error), the actor returns the native extraction record with structuredJson: null rather than failing the run.
  • Custom prompts let you tailor the structuring output for a specific document type. For example: "Extract all line items as an array of {description, quantity, unit_price, total}".
  • File size limit: 50 MB per PDF. Larger files are rejected with an error record.
  • OpenAI costs are billed to your API key separately from actor usage.