PDF Table Extract MCP - Tables to JSON & CSV avatar

PDF Table Extract MCP - Tables to JSON & CSV

Pricing

Pay per event + usage

Go to Apify Store
PDF Table Extract MCP - Tables to JSON & CSV

PDF Table Extract MCP - Tables to JSON & CSV

Extract tables from PDF documents and convert to structured JSON. Uses PDFium backend (Apache 2.0). No Ghostscript required.

Pricing

Pay per event + usage

Rating

0.0

(0)

Developer

daehwan kim

daehwan kim

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

6 hours ago

Last modified

Share

PDF Table Extract MCP

Extract tables from PDF documents and convert to structured JSON. No Ghostscript required.

Overview

This Apify MCP actor extracts tables from PDF documents using Camelot with PDFium backend (Apache 2.0 license). PDFium is a modern, open-source PDF engine with superior table extraction accuracy compared to legacy approaches.

Key Features

  • Two extraction methods: Lattice detection (line-based) and Stream method (text-based)
  • Structured output: Raw tables or auto-converted to JSON with headers as keys
  • No GPL dependencies: Uses PDFium (Apache 2.0), NOT Ghostscript (AGPL-3.0)
  • Page filtering: Extract specific pages or page ranges
  • Pay-per-event pricing: $0.05–$0.08 per API call
  • MCP integration: Works seamlessly with Claude Desktop and other MCP clients

Attribution

Powered by:

  • Camelot (MIT License) — Table extraction logic
  • PDFium (Apache 2.0) — PDF rendering engine
  • Apify (Proprietary) — Actor deployment platform

No Ghostscript (AGPL) included. All dependencies are permissive or Apache 2.0 licensed.

Usage

Tool 1: extract_tables ($0.05 per call)

Extract raw tables from a PDF as 2D arrays.

Input:

{
"pdf_url": "https://example.com/document.pdf",
"pages": "1,2,5-10", // Optional. Default: "all"
"flavor": "lattice" // Optional. "lattice" (default) or "stream"
}

Output:

{
"tables": [
{
"page": 1,
"table_number": 0,
"rows": 5,
"columns": 3,
"data": [
["Header 1", "Header 2", "Header 3"],
["Row 1 Col 1", "Row 1 Col 2", "Row 1 Col 3"],
["Row 2 Col 1", "Row 2 Col 2", "Row 2 Col 3"]
],
"accuracy": 0.95
}
],
"total_tables": 12,
"pages_processed": 10
}

Usage Example:

# Using Claude API with MCP
curl -X POST "https://api.ntriq.co.kr/ai/tools/extract_tables" \
-H "Content-Type: application/json" \
-H "X-YAP-Key: your_api_key" \
-d '{
"pdf_url": "https://example.com/report.pdf",
"pages": "1-5"
}'

Tool 2: tables_to_json ($0.08 per call)

Extract tables and automatically convert to structured JSON with headers as object keys.

Input:

{
"pdf_url": "https://example.com/document.pdf",
"pages": "all" // Optional. Default: "all"
}

Output:

{
"tables": [
{
"page": 1,
"table_number": 0,
"rows": 5,
"columns": 3,
"json_data": [
{
"Header 1": "Row 1 Col 1",
"Header 2": "Row 1 Col 2",
"Header 3": "Row 1 Col 3"
},
{
"Header 1": "Row 2 Col 1",
"Header 2": "Row 2 Col 2",
"Header 3": "Row 2 Col 3"
}
],
"accuracy": 0.95
}
],
"total_tables": 12,
"pages_processed": 10
}

Usage Example:

# Perfect for downstream processing
import json
result = client.call_mcp_tool('tables_to_json', {
'pdf_url': 'https://example.com/financial_report.pdf'
})
for table in result['tables']:
df = pd.DataFrame(table['json_data'])
print(df)

Extraction Methods

Lattice Method (Default)

  • Best for: Structured tables with visible grid lines
  • Uses: Line detection to identify cell boundaries
  • Accuracy: Very high for formal reports, spreadsheets
  • Speed: Fast

Stream Method

  • Best for: Text-only tables without visible grid lines
  • Uses: Whitespace and text positioning
  • Accuracy: Good for loose, unformatted data
  • Speed: Slightly slower

Technical Details

PDF Processing Pipeline

PDF URL / Base64
[PDFium Renderer]
[Camelot Extraction]
[Format Conversion]
JSON Output

Timeout Configuration

  • API timeout: 120 seconds + 15-second buffer
  • Recommended PDF size: < 50 MB
  • Max pages per call: All pages supported (split large PDFs if needed)

Error Handling

{
"success": false,
"error": "Failed to download PDF from URL: Connection timeout after 30s",
"toolName": "extract_tables"
}

Common errors:

  • Connection timeout — PDF URL unreachable
  • Invalid PDF format — Corrupted file or non-PDF content
  • No tables found — PDF contains no extractable tables
  • Extraction timeout — PDF too large or complex (retry with specific pages)

Pricing

ToolCostUse Case
extract_tables$0.05Raw 2D array extraction
tables_to_json$0.08Structured JSON with headers

Billing: Per successful extraction. Errors are not charged.

Requirements

  • Node.js: 20.0+
  • Docker: For Apify deployment
  • No external PDF tools: PDFium is bundled

Local Development

npm install
npm start
# Test with local input
cat > input.json << 'EOF'
{
"pdf_url": "https://example.com/sample.pdf",
"pages": "all"
}
EOF
APIFY_LOCAL_STORAGE_DIR=./storage node src/main.js

Deployment

$apify push

Licensing

  • Camelot: MIT License
  • PDFium: Apache 2.0 License
  • This Actor: MIT License

Compliance Note: This actor uses open-source, permissive licenses. No GPL/AGPL dependencies. Safe for commercial use.

Support

  • Issues: Report via Apify platform or ntriq support
  • Documentation: ai.ntriq.co.kr → /document/extract-tables API docs
  • Rate limits: Handled by Apify (see account dashboard)

Changelog

v1.0.0 — 2026-03-29

  • Initial release
  • Two tools: extract_tables + tables_to_json
  • PDFium backend with Lattice/Stream flavors
  • PPE billing with Actor.charge() integration

Powered by Camelot (MIT) with PDFium (Apache 2.0)