PDF to Structured Data (JSON/CSV)
Pricing
from $10.00 / 1,000 pdf processeds
PDF to Structured Data (JSON/CSV)
Convert PDF files into clean structured JSON or CSV: text per page, reconstructed lines, optional table detection, and document metadata.
Pricing
from $10.00 / 1,000 pdf processeds
Rating
0.0
(0)
Developer
Rosario Vitale
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Turn any PDF into clean, structured data your code can actually use. Give the Actor one or more PDF URLs and get back text per page, reconstructed lines in reading order, optional table detection, and document metadata β as JSON or CSV.
No more copy-pasting from PDFs by hand or fighting with brittle local libraries. Send a URL, get structured output.
What it does
- π Text extraction β full text of every page, in natural reading order.
- π Line reconstruction β text items are grouped by position into real lines, not a jumbled blob.
- π Table detection (optional) β heuristically splits rows into cells so you can rebuild tables.
- π·οΈ Metadata (optional) β title, author, producer and creation date when present.
- π Batch β pass many PDF URLs in a single run.
Input
| Field | Type | Description |
|---|---|---|
pdfUrls | array of strings | Direct links to the PDF files (required). |
extractTables | boolean | Detect tables and return rows of cells. Default false. |
extractMetadata | boolean | Include document metadata. Default true. |
maxPages | integer | Max pages to read per PDF. 0 = all. Default 0. |
Example input
{"pdfUrls": ["https://raw.githubusercontent.com/mozilla/pdf.js/master/web/compressed.tracemonkey-pldi-09.pdf"],"extractTables": false,"extractMetadata": true,"maxPages": 0}
Output
One dataset item per PDF:
{"url": "https://.../document.pdf","success": true,"numPages": 14,"pagesExtracted": 14,"metadata": { "Producer": "pdfeTeX-1.21a", "Creator": "TeX", "CreationDate": "..." },"pages": [{"pageNumber": 1,"text": "Trace-based Just-in-Time Type Specialization ...","lines": ["Trace-based Just-in-Time Type Specialization ...", "Languages"],"tables": [["Cell A", "Cell B"], ["1", "2"]]}],"fullText": "Trace-based Just-in-Time Type Specialization ..."}
Export the dataset as JSON, CSV, Excel, or HTML straight from the run, or pull it through the Apify API.
Common use cases
- Extract data from invoices, receipts, price lists, and bank statements.
- Feed PDF text into search, RAG pipelines, or LLMs.
- Turn reports and catalogs into spreadsheets.
- Archive and index document text at scale.
Notes & limits
- Works on text-based PDFs. Scanned/image-only PDFs contain no selectable text, so they need OCR (not included in this version).
- Table detection is a position-based heuristic β great for clean, grid-like tables, approximate for complex layouts.
pdfUrlsmust be direct links to the PDF file (not a viewer page).
Pricing
Pay-per-result: you are billed per PDF successfully processed. Failed downloads/parses are returned with success: false and are not charged.