PDF to Structured Data (JSON/CSV) avatar

PDF to Structured Data (JSON/CSV)

Pricing

from $10.00 / 1,000 pdf processeds

Go to Apify Store
PDF to Structured Data (JSON/CSV)

PDF to Structured Data (JSON/CSV)

Convert PDF files into clean structured JSON or CSV: text per page, reconstructed lines, optional table detection, and document metadata.

Pricing

from $10.00 / 1,000 pdf processeds

Rating

0.0

(0)

Developer

Rosario Vitale

Rosario Vitale

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Turn any PDF into clean, structured data your code can actually use. Give the Actor one or more PDF URLs and get back text per page, reconstructed lines in reading order, optional table detection, and document metadata β€” as JSON or CSV.

No more copy-pasting from PDFs by hand or fighting with brittle local libraries. Send a URL, get structured output.

What it does

  • πŸ“„ Text extraction β€” full text of every page, in natural reading order.
  • πŸ“ Line reconstruction β€” text items are grouped by position into real lines, not a jumbled blob.
  • πŸ“Š Table detection (optional) β€” heuristically splits rows into cells so you can rebuild tables.
  • 🏷️ Metadata (optional) β€” title, author, producer and creation date when present.
  • πŸ” Batch β€” pass many PDF URLs in a single run.

Input

FieldTypeDescription
pdfUrlsarray of stringsDirect links to the PDF files (required).
extractTablesbooleanDetect tables and return rows of cells. Default false.
extractMetadatabooleanInclude document metadata. Default true.
maxPagesintegerMax pages to read per PDF. 0 = all. Default 0.

Example input

{
"pdfUrls": [
"https://raw.githubusercontent.com/mozilla/pdf.js/master/web/compressed.tracemonkey-pldi-09.pdf"
],
"extractTables": false,
"extractMetadata": true,
"maxPages": 0
}

Output

One dataset item per PDF:

{
"url": "https://.../document.pdf",
"success": true,
"numPages": 14,
"pagesExtracted": 14,
"metadata": { "Producer": "pdfeTeX-1.21a", "Creator": "TeX", "CreationDate": "..." },
"pages": [
{
"pageNumber": 1,
"text": "Trace-based Just-in-Time Type Specialization ...",
"lines": ["Trace-based Just-in-Time Type Specialization ...", "Languages"],
"tables": [["Cell A", "Cell B"], ["1", "2"]]
}
],
"fullText": "Trace-based Just-in-Time Type Specialization ..."
}

Export the dataset as JSON, CSV, Excel, or HTML straight from the run, or pull it through the Apify API.

Common use cases

  • Extract data from invoices, receipts, price lists, and bank statements.
  • Feed PDF text into search, RAG pipelines, or LLMs.
  • Turn reports and catalogs into spreadsheets.
  • Archive and index document text at scale.

Notes & limits

  • Works on text-based PDFs. Scanned/image-only PDFs contain no selectable text, so they need OCR (not included in this version).
  • Table detection is a position-based heuristic β€” great for clean, grid-like tables, approximate for complex layouts.
  • pdfUrls must be direct links to the PDF file (not a viewer page).

Pricing

Pay-per-result: you are billed per PDF successfully processed. Failed downloads/parses are returned with success: false and are not charged.