# Financial Table Extractor for PDFs (`dainty_dogfish/okra-financial-table-extractor`) Actor

Extract annual-report and 10-K table rows from PDF URLs into typed JSON with page, quote, and cell bbox evidence. Runs self-contained on Apify; no Okra API key required.

- **URL**: https://apify.com/dainty\_dogfish/okra-financial-table-extractor.md
- **Developed by:** [Steven](https://apify.com/dainty_dogfish) (community)
- **Categories:** AI, Integrations, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Financial Table Extractor for PDFs

Extract finance and annual-report table rows from PDFs into normalized JSON with typed values, page references, source quotes, and bounding-box evidence.

This actor is for workflows where plain PDF text is not enough. It targets tables where exact numeric values, row labels, units, missing values, and citation evidence matter.

It runs fully inside the Apify actor container. It does not require an okraPDF account, Okra API key, external OCR service, or LLM API.

### How It Parses PDFs

The actor:

1. Downloads each public PDF URL into temporary actor storage.
2. Opens the PDF locally with `pdfplumber`, which uses `pdfminer.six` for born-digital PDF text and geometry.
3. Extracts words with coordinates, then groups them into lines by page position.
4. Finds the requested table captions from `tableHints`.
5. Collects nearby numeric rows, handles currency, percentages, parentheses negatives, dashes as nulls, and note columns.
6. Infers columns from nearby headers and emits normalized rows plus page, quote, row bbox, and cell bbox evidence.

No document bytes are sent to an Okra backend.

#### Telemetry

The actor sends **anonymous run telemetry** to okraPDF analytics (PostHog) on each run: run status, document/table/row counts, duration, and actor version — **never** PDF content, URLs, file names, or extracted values. This lets okraPDF understand how the actor is used. Opt out by setting the environment variable `OKRA_TELEMETRY=0`.

### Best Use Cases

- Annual-report financial statements
- 10-K segment revenue and operating-income tables
- Balance sheet, profit and loss, and cash flow tables
- Investor-presentation KPI tables
- Benchmark tables where each row needs cell-level evidence

Validated table hints include:

- `Revenue by Reportable Segments`
- `Operating Income by Reportable Segments`
- `Balance sheet`
- `Profit and loss account`
- `Statement of cash flows`
- `Table 1`, `Table 2`, etc.

Dense scientific tables with multi-band headers are supported best-effort. The strongest validated path is annual-report and financial-statement extraction.

### Input

Provide direct PDF URLs and one or more table titles/captions to extract.

```json
{
  "pdfUrls": ["https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"],
  "tableHints": ["Revenue by Reportable Segments"],
  "maxPages": 120,
  "output": {
    "dataset": true,
    "storeJsonKey": "financial-tables.json"
  }
}
````

Annual-report statement example:

```json
{
  "pdfUrls": ["https://www.bis.org/about/areport/areport2024.pdf"],
  "tableHints": ["Balance sheet", "Profit and loss account", "Statement of cash flows"],
  "maxPages": 181
}
```

Academic benchmark table example:

```json
{
  "pdfUrls": ["https://arxiv.org/pdf/2509.18965"],
  "tableHints": ["Table 3", "Table 4"],
  "maxPages": 12
}
```

### Output

Each result contains:

- `document`: source metadata.
- `tables`: normalized table objects.
- `rows`: one row per table line item.
- `values`: typed numeric values keyed by inferred column names.
- `evidence`: page number, table bbox, row bbox, original quote, and cell bboxes.

Example row:

```json
{
  "label": "Total assets",
  "values": {
    "2024": 379155.4,
    "2023": 350309.6
  },
  "evidence": {
    "page": 177,
    "table_title": "Balance sheet",
    "quote": "Total assets 379,155.4 350,309.6"
  }
}
```

For NVIDIA's 2024 10-K reportable-segment table, the actor also emits compatibility fields such as `jan_28_2024_millions`, `jan_29_2023_millions`, `dollar_change_millions`, and `percent_change`.

### Validation

The actor was benchmarked against existing Apify PDF actors and source-specific disclosure actors. Plain PDF text actors are useful for RAG chunks, but they do not return typed financial rows with page/cell evidence. General PDF-to-markdown actors can expose table text but often lose row/column alignment on financial tables.

Remote Apify validation includes:

| Source | Output |
|---|---|
| NVIDIA 2024 10-K | 1 table, 3 rows, segment revenue values validated |
| BIS Annual Report 2023/24 | 3 tables, 59 rows, balance sheet/profit/cash flow values validated |
| arXiv benchmark paper | 2 tables, 16 rows, validated against arXiv HTML |
| Federal Register negative control | 0 tables, expected warning |

Known limitations:

- Scanned PDFs without embedded text are not OCR'd by this actor.
- Very dense tables with multi-band scientific headers may need post-processing.
- Table extraction is guided by `tableHints`; this actor does not yet discover every table automatically.

### Development Verification

The actor is tested with focused regression cases for financial-statement parsing, caption false positives, note-column stripping, missing values, and wrapped labels.

# Actor input Schema

## `pdfUrls` (type: `array`):

Direct URLs to born-digital PDF files. Public HTTP(S) URLs work best.

## `tableHints` (type: `array`):

Table titles or captions to extract, such as Revenue by Reportable Segments, Balance sheet, or Table 3.

## `maxPages` (type: `integer`):

Maximum pages to scan per PDF.

## `output` (type: `object`):

Control dataset and key-value store outputs.

## Actor input object example

```json
{
  "pdfUrls": [
    "https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"
  ],
  "tableHints": [
    "Revenue by Reportable Segments"
  ],
  "maxPages": 120
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "pdfUrls": [
        "https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"
    ],
    "tableHints": [
        "Revenue by Reportable Segments"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("dainty_dogfish/okra-financial-table-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "pdfUrls": ["https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"],
    "tableHints": ["Revenue by Reportable Segments"],
}

# Run the Actor and wait for it to finish
run = client.actor("dainty_dogfish/okra-financial-table-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "pdfUrls": [
    "https://annualreports.ai/wp-content/uploads/10k-form-nvidia-2024.pdf"
  ],
  "tableHints": [
    "Revenue by Reportable Segments"
  ]
}' |
apify call dainty_dogfish/okra-financial-table-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=dainty_dogfish/okra-financial-table-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Financial Table Extractor for PDFs",
        "description": "Extract annual-report and 10-K table rows from PDF URLs into typed JSON with page, quote, and cell bbox evidence. Runs self-contained on Apify; no Okra API key required.",
        "version": "0.1",
        "x-build-id": "VIjMc0Ysg4MjgIBSx"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/dainty_dogfish~okra-financial-table-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-dainty_dogfish-okra-financial-table-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/dainty_dogfish~okra-financial-table-extractor/runs": {
            "post": {
                "operationId": "runs-sync-dainty_dogfish-okra-financial-table-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/dainty_dogfish~okra-financial-table-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-dainty_dogfish-okra-financial-table-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "pdfUrls": {
                        "title": "PDF URLs",
                        "type": "array",
                        "description": "Direct URLs to born-digital PDF files. Public HTTP(S) URLs work best.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "tableHints": {
                        "title": "Table Hints",
                        "type": "array",
                        "description": "Table titles or captions to extract, such as Revenue by Reportable Segments, Balance sheet, or Table 3.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "Revenue by Reportable Segments"
                        ]
                    },
                    "maxPages": {
                        "title": "Max Pages",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum pages to scan per PDF.",
                        "default": 120
                    },
                    "output": {
                        "title": "Output Options",
                        "type": "object",
                        "description": "Control dataset and key-value store outputs.",
                        "properties": {
                            "dataset": {
                                "title": "Dataset",
                                "type": "boolean",
                                "description": "Push extracted tables to the default dataset.",
                                "default": true
                            },
                            "storeJsonKey": {
                                "title": "JSON Store Key",
                                "type": "string",
                                "description": "Key for the full JSON result in the default key-value store.",
                                "default": "financial-tables.json"
                            }
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
