Financial Table Extractor for PDFs avatar

Financial Table Extractor for PDFs

Pricing

Pay per usage

Go to Apify Store
Financial Table Extractor for PDFs

Financial Table Extractor for PDFs

Extract annual-report and 10-K table rows from PDF URLs into typed JSON with page, quote, and cell bbox evidence. Runs self-contained on Apify; no Okra API key required.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Steven

Steven

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Extract finance and annual-report table rows from PDFs into normalized JSON with typed values, page references, source quotes, and bounding-box evidence.

This actor is for workflows where plain PDF text is not enough. It targets tables where exact numeric values, row labels, units, missing values, and citation evidence matter.

It runs fully inside the Apify actor container. It does not require an okraPDF account, Okra API key, external OCR service, or LLM API.

How It Parses PDFs

The actor:

  1. Downloads each public PDF URL into temporary actor storage.
  2. Opens the PDF locally with pdfplumber, which uses pdfminer.six for born-digital PDF text and geometry.
  3. Extracts words with coordinates, then groups them into lines by page position.
  4. Finds the requested table captions from tableHints.
  5. Collects nearby numeric rows, handles currency, percentages, parentheses negatives, dashes as nulls, and note columns.
  6. Infers columns from nearby headers and emits normalized rows plus page, quote, row bbox, and cell bbox evidence.

No document bytes are sent to an Okra backend.

Telemetry

The actor sends anonymous run telemetry to okraPDF analytics (PostHog) on each run: run status, document/table/row counts, duration, and actor version — never PDF content, URLs, file names, or extracted values. This lets okraPDF understand how the actor is used. Opt out by setting the environment variable OKRA_TELEMETRY=0.

Best Use Cases

  • Annual-report financial statements
  • 10-K segment revenue and operating-income tables
  • Balance sheet, profit and loss, and cash flow tables
  • Investor-presentation KPI tables
  • Benchmark tables where each row needs cell-level evidence

Validated table hints include:

  • Revenue by Reportable Segments
  • Operating Income by Reportable Segments
  • Balance sheet
  • Profit and loss account
  • Statement of cash flows
  • Table 1, Table 2, etc.

Dense scientific tables with multi-band headers are supported best-effort. The strongest validated path is annual-report and financial-statement extraction.

Input

Provide direct PDF URLs and one or more table titles/captions to extract.

{
"pdfUrls": ["https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"],
"tableHints": ["Revenue by Reportable Segments"],
"maxPages": 120,
"output": {
"dataset": true,
"storeJsonKey": "financial-tables.json"
}
}

Annual-report statement example:

{
"pdfUrls": ["https://www.bis.org/about/areport/areport2024.pdf"],
"tableHints": ["Balance sheet", "Profit and loss account", "Statement of cash flows"],
"maxPages": 181
}

Academic benchmark table example:

{
"pdfUrls": ["https://arxiv.org/pdf/2509.18965"],
"tableHints": ["Table 3", "Table 4"],
"maxPages": 12
}

Output

Each result contains:

  • document: source metadata.
  • tables: normalized table objects.
  • rows: one row per table line item.
  • values: typed numeric values keyed by inferred column names.
  • evidence: page number, table bbox, row bbox, original quote, and cell bboxes.

Example row:

{
"label": "Total assets",
"values": {
"2024": 379155.4,
"2023": 350309.6
},
"evidence": {
"page": 177,
"table_title": "Balance sheet",
"quote": "Total assets 379,155.4 350,309.6"
}
}

For NVIDIA's 2024 10-K reportable-segment table, the actor also emits compatibility fields such as jan_28_2024_millions, jan_29_2023_millions, dollar_change_millions, and percent_change.

Validation

The actor was benchmarked against existing Apify PDF actors and source-specific disclosure actors. Plain PDF text actors are useful for RAG chunks, but they do not return typed financial rows with page/cell evidence. General PDF-to-markdown actors can expose table text but often lose row/column alignment on financial tables.

Remote Apify validation includes:

SourceOutput
NVIDIA 2024 10-K1 table, 3 rows, segment revenue values validated
BIS Annual Report 2023/243 tables, 59 rows, balance sheet/profit/cash flow values validated
arXiv benchmark paper2 tables, 16 rows, validated against arXiv HTML
Federal Register negative control0 tables, expected warning

Known limitations:

  • Scanned PDFs without embedded text are not OCR'd by this actor.
  • Very dense tables with multi-band scientific headers may need post-processing.
  • Table extraction is guided by tableHints; this actor does not yet discover every table automatically.

Development Verification

The actor is tested with focused regression cases for financial-statement parsing, caption false positives, note-column stripping, missing values, and wrapped labels.