PDF to JSON Parser
Pricing
Pay per event
Go to Apify Store
PDF to JSON Parser
Convert PDF documents into structured JSON. Extracts text, tables, and fields from any PDF URL. Optional AI structuring pass (BYO OpenAI key) turns raw text into clean, organized JSON ready for automation or analysis.
Pricing
Pay per event
Rating
0.0
(0)
Developer
BowTiedRaccoon
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
Convert PDF documents into structured JSON. Supply a list of public PDF URLs — the actor downloads each file, extracts text from every page, and returns clean, organized output. Add your OpenAI API key to get an AI-powered structuring pass that turns raw text into categorized JSON fields.
What it does
- Accepts a list of public PDF URLs (up to 50 MB per file)
- Downloads each PDF to temporary storage and extracts text per page using native PDF parsing
- Processes every page for complete coverage — no pages skipped
- Optionally runs an AI structuring pass (OpenAI GPT-4o-mini or GPT-4o) that organizes the raw text into titled sections, tables, key fields, and metadata
- Returns one dataset record per PDF with the full extracted text, per-page breakdown, and AI output
- Saves error records for PDFs that fail to download or parse — the run continues
Use cases
- Invoice and receipt extraction for accounting automation
- Contract and legal document analysis
- Academic paper indexing and summarization
- Form data extraction from government or regulatory PDFs
- Report parsing for data pipelines
- Bulk document conversion for RAG / LLM pipelines
Input
| Field | Type | Required | Description |
|---|---|---|---|
pdfUrls | Array | Yes | Public PDF URLs to process. Must be directly downloadable. |
openaiApiKey | String | No | Your OpenAI API key (sk-...). Enables AI structuring. Not stored. |
extractionPrompt | String | No | Custom prompt for the AI structuring pass. Leave blank to use the default (extracts title, author, summary, sections, tables, key fields). |
model | Select | No | OpenAI model: gpt-4o-mini (default, fast) or gpt-4o (most capable). |
maxItems | Integer | No | Maximum PDFs to process per run. Default: 15. |
Output
One dataset record per PDF:
| Field | Type | Description |
|---|---|---|
sourceUrl | String | Original PDF URL |
pageCount | Number | Number of pages in the PDF |
rawText | String | Full extracted text (all pages concatenated) |
pages | String | JSON array of per-page text: [{"page": 1, "text": "..."}] |
structuredJson | String | AI-structured output as JSON string (null if no API key supplied) |
model | String | OpenAI model used (null if AI pass skipped) |
processedAt | String | ISO timestamp when processing completed |
status | String | success or error |
errorMsg | String | Error message on failure, null on success |
Example record (native extraction only)
{"sourceUrl": "https://example.com/invoice-2024-01.pdf","pageCount": 2,"rawText": "Invoice #INV-2024-001\nDate: January 15, 2024\n...","pages": "[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"},{\"page\":2,\"text\":\"Payment terms...\"}]","structuredJson": null,"model": null,"processedAt": "2026-06-07T12:00:00.000Z","status": "success","errorMsg": null}
Example record (with AI structuring)
{"sourceUrl": "https://example.com/invoice-2024-01.pdf","pageCount": 2,"rawText": "Invoice #INV-2024-001\nDate: January 15, 2024\n...","pages": "[{\"page\":1,\"text\":\"Invoice #INV-2024-001...\"}]","structuredJson": "{\"title\":\"Invoice #INV-2024-001\",\"date\":\"January 15, 2024\",\"key_fields\":{\"invoice_number\":\"INV-2024-001\",\"amount\":\"$1,250.00\"}}","model": "gpt-4o-mini","processedAt": "2026-06-07T12:00:00.000Z","status": "success","errorMsg": null}
Notes
- Native extraction works on any text-based PDF (invoices, reports, forms, contracts). Scanned image-only PDFs return empty text — OCR for image PDFs is not currently supported.
- AI structuring is additive. Even when the OpenAI call fails (rate limit, invalid key, network error), the actor returns the native extraction record with
structuredJson: nullrather than failing the run. - Custom prompts let you tailor the structuring output for a specific document type. For example:
"Extract all line items as an array of {description, quantity, unit_price, total}". - File size limit: 50 MB per PDF. Larger files are rejected with an error record.
- OpenAI costs are billed to your API key separately from actor usage.