Financial Table Extractor for PDFs
Pricing
Pay per usage
Financial Table Extractor for PDFs
Extract annual-report and 10-K table rows from PDF URLs into typed JSON with page, quote, and cell bbox evidence. Runs self-contained on Apify; no Okra API key required.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Steven
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Extract finance and annual-report table rows from PDFs into normalized JSON with typed values, page references, source quotes, and bounding-box evidence.
This actor is for workflows where plain PDF text is not enough. It targets tables where exact numeric values, row labels, units, missing values, and citation evidence matter.
It runs fully inside the Apify actor container. It does not require an okraPDF account, Okra API key, external OCR service, or LLM API.
How It Parses PDFs
The actor:
- Downloads each public PDF URL into temporary actor storage.
- Opens the PDF locally with
pdfplumber, which usespdfminer.sixfor born-digital PDF text and geometry. - Extracts words with coordinates, then groups them into lines by page position.
- Finds the requested table captions from
tableHints. - Collects nearby numeric rows, handles currency, percentages, parentheses negatives, dashes as nulls, and note columns.
- Infers columns from nearby headers and emits normalized rows plus page, quote, row bbox, and cell bbox evidence.
No document bytes are sent to an Okra backend.
Telemetry
The actor sends anonymous run telemetry to okraPDF analytics (PostHog) on each run: run status, document/table/row counts, duration, and actor version — never PDF content, URLs, file names, or extracted values. This lets okraPDF understand how the actor is used. Opt out by setting the environment variable OKRA_TELEMETRY=0.
Best Use Cases
- Annual-report financial statements
- 10-K segment revenue and operating-income tables
- Balance sheet, profit and loss, and cash flow tables
- Investor-presentation KPI tables
- Benchmark tables where each row needs cell-level evidence
Validated table hints include:
Revenue by Reportable SegmentsOperating Income by Reportable SegmentsBalance sheetProfit and loss accountStatement of cash flowsTable 1,Table 2, etc.
Dense scientific tables with multi-band headers are supported best-effort. The strongest validated path is annual-report and financial-statement extraction.
Input
Provide direct PDF URLs and one or more table titles/captions to extract.
{"pdfUrls": ["https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"],"tableHints": ["Revenue by Reportable Segments"],"maxPages": 120,"output": {"dataset": true,"storeJsonKey": "financial-tables.json"}}
Annual-report statement example:
{"pdfUrls": ["https://www.bis.org/about/areport/areport2024.pdf"],"tableHints": ["Balance sheet", "Profit and loss account", "Statement of cash flows"],"maxPages": 181}
Academic benchmark table example:
{"pdfUrls": ["https://arxiv.org/pdf/2509.18965"],"tableHints": ["Table 3", "Table 4"],"maxPages": 12}
Output
Each result contains:
document: source metadata.tables: normalized table objects.rows: one row per table line item.values: typed numeric values keyed by inferred column names.evidence: page number, table bbox, row bbox, original quote, and cell bboxes.
Example row:
{"label": "Total assets","values": {"2024": 379155.4,"2023": 350309.6},"evidence": {"page": 177,"table_title": "Balance sheet","quote": "Total assets 379,155.4 350,309.6"}}
For NVIDIA's 2024 10-K reportable-segment table, the actor also emits compatibility fields such as jan_28_2024_millions, jan_29_2023_millions, dollar_change_millions, and percent_change.
Validation
The actor was benchmarked against existing Apify PDF actors and source-specific disclosure actors. Plain PDF text actors are useful for RAG chunks, but they do not return typed financial rows with page/cell evidence. General PDF-to-markdown actors can expose table text but often lose row/column alignment on financial tables.
Remote Apify validation includes:
| Source | Output |
|---|---|
| NVIDIA 2024 10-K | 1 table, 3 rows, segment revenue values validated |
| BIS Annual Report 2023/24 | 3 tables, 59 rows, balance sheet/profit/cash flow values validated |
| arXiv benchmark paper | 2 tables, 16 rows, validated against arXiv HTML |
| Federal Register negative control | 0 tables, expected warning |
Known limitations:
- Scanned PDFs without embedded text are not OCR'd by this actor.
- Very dense tables with multi-band scientific headers may need post-processing.
- Table extraction is guided by
tableHints; this actor does not yet discover every table automatically.
Development Verification
The actor is tested with focused regression cases for financial-statement parsing, caption false positives, note-column stripping, missing values, and wrapped labels.


