Pricing

Pay per usage

Financial Table Extractor for PDFs

Extract annual-report and 10-K table rows from PDF URLs into typed JSON with page, quote, and cell bbox evidence. Runs self-contained on Apify; no Okra API key required.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Steven

Actor stats

Bookmarked

Total users

Monthly active users

25 days ago

Last modified

How It Parses PDFs

The actor:

Downloads each public PDF URL into temporary actor storage.
Opens the PDF locally with pdfplumber, which uses pdfminer.six for born-digital PDF text and geometry.
Extracts words with coordinates, then groups them into lines by page position.
Finds the requested table captions from tableHints.
Collects nearby numeric rows, handles currency, percentages, parentheses negatives, dashes as nulls, and note columns.
Infers columns from nearby headers and emits normalized rows plus page, quote, row bbox, and cell bbox evidence.

No document bytes are sent to an Okra backend.

Telemetry

The actor sends anonymous run telemetry to okraPDF analytics (PostHog) on each run: run status, document/table/row counts, duration, and actor version — never PDF content, URLs, file names, or extracted values. This lets okraPDF understand how the actor is used. Opt out by setting the environment variable OKRA_TELEMETRY=0.

Best Use Cases

Annual-report financial statements
10-K segment revenue and operating-income tables
Balance sheet, profit and loss, and cash flow tables
Investor-presentation KPI tables
Benchmark tables where each row needs cell-level evidence

Validated table hints include:

Revenue by Reportable Segments
Operating Income by Reportable Segments
Balance sheet
Profit and loss account
Statement of cash flows
Table 1, Table 2, etc.

Dense scientific tables with multi-band headers are supported best-effort. The strongest validated path is annual-report and financial-statement extraction.

Input

Provide direct PDF URLs and one or more table titles/captions to extract.

{
  "pdfUrls": ["https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"],
  "tableHints": ["Revenue by Reportable Segments"],
  "maxPages": 120,
  "output": {
    "dataset": true,
    "storeJsonKey": "financial-tables.json"
  }
}

Annual-report statement example:

{
  "pdfUrls": ["https://www.bis.org/about/areport/areport2024.pdf"],
  "tableHints": ["Balance sheet", "Profit and loss account", "Statement of cash flows"],
  "maxPages": 181
}

Academic benchmark table example:

{
  "pdfUrls": ["https://arxiv.org/pdf/2509.18965"],
  "tableHints": ["Table 3", "Table 4"],
  "maxPages": 12
}

Output

Each result contains:

document: source metadata.
tables: normalized table objects.
rows: one row per table line item.
values: typed numeric values keyed by inferred column names.
evidence: page number, table bbox, row bbox, original quote, and cell bboxes.

Example row:

{
  "label": "Total assets",
  "values": {
    "2024": 379155.4,
    "2023": 350309.6
  },
  "evidence": {
    "page": 177,
    "table_title": "Balance sheet",
    "quote": "Total assets 379,155.4 350,309.6"
  }
}

For NVIDIA's 2024 10-K reportable-segment table, the actor also emits compatibility fields such as jan_28_2024_millions, jan_29_2023_millions, dollar_change_millions, and percent_change.

Validation

The actor was benchmarked against existing Apify PDF actors and source-specific disclosure actors. Plain PDF text actors are useful for RAG chunks, but they do not return typed financial rows with page/cell evidence. General PDF-to-markdown actors can expose table text but often lose row/column alignment on financial tables.

Remote Apify validation includes:

Source	Output
NVIDIA 2024 10-K	1 table, 3 rows, segment revenue values validated
BIS Annual Report 2023/24	3 tables, 59 rows, balance sheet/profit/cash flow values validated
arXiv benchmark paper	2 tables, 16 rows, validated against arXiv HTML
Federal Register negative control	0 tables, expected warning

Known limitations:

Scanned PDFs without embedded text are not OCR'd by this actor.
Very dense tables with multi-band scientific headers may need post-processing.
Table extraction is guided by tableHints; this actor does not yet discover every table automatically.

Development Verification

The actor is tested with focused regression cases for financial-statement parsing, caption false positives, note-column stripping, missing values, and wrapped labels.

PDF Table Extractor

zentrafoundry/pdf-table-extractor

Transform pdf table extractor inputs into structured rows, clear errors, confidence signals, and automation-ready output.

Zentra

PDF Table Extractor / Docling Wrapper

zentrafoundry/pdf-table-docling-extractor

Existing Apify actor imported into Zentra maintenance. Quality, sample, source, pricing, and ProductStudio publication evidence are managed from Zentra after import.

Zentra

HTML Table Extractor

benthepythondev/html-table-extractor

Extract structured rows from HTML tables on any web page.

Ben

Table Extractor â€” Scrape HTML Tables from Any URL to JSON/CSV

eliai/webpage-tables-extractor

Table extractor for webpages: pass a URL and get every HTML table back as structured rows â€” JSON via the API or dataset, CSV via one-click export. For analysts, developers, and AI agents that need tabular data without writing a parser. Pay only for results, no code required.

Anthony Snider

Evidence-First PDF Table Extractor

defenestrator/evidence-first-pdf-table-extractor

Extract PDF tables to CSV and JSON with page and bounding-box provenance, OCR fallback, and explicit quality flags.

Defenestrator

Pdf to json

shahabuddin38/pdf-to-json

Convert PDF files into structured JSON with optional OCR, table extraction, key-value detection, and metadata parsing. Ideal for invoices, receipts, contracts, statements, forms, and document automation workflows. Supports digital and scanned PDFs for API-ready data extraction.

Shahab Uddin

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.