PDF Table Extract MCP - Tables to JSON & CSV
Pricing
Pay per event + usage
PDF Table Extract MCP - Tables to JSON & CSV
Extract tables from PDF documents and convert to structured JSON. Uses PDFium backend (Apache 2.0). No Ghostscript required.
Pricing
Pay per event + usage
Rating
0.0
(0)
Developer
daehwan kim
Actor stats
0
Bookmarked
1
Total users
0
Monthly active users
6 hours ago
Last modified
Categories
Share
PDF Table Extract MCP
Extract tables from PDF documents and convert to structured JSON. No Ghostscript required.
Overview
This Apify MCP actor extracts tables from PDF documents using Camelot with PDFium backend (Apache 2.0 license). PDFium is a modern, open-source PDF engine with superior table extraction accuracy compared to legacy approaches.
Key Features
- ✅ Two extraction methods: Lattice detection (line-based) and Stream method (text-based)
- ✅ Structured output: Raw tables or auto-converted to JSON with headers as keys
- ✅ No GPL dependencies: Uses PDFium (Apache 2.0), NOT Ghostscript (AGPL-3.0)
- ✅ Page filtering: Extract specific pages or page ranges
- ✅ Pay-per-event pricing: $0.05–$0.08 per API call
- ✅ MCP integration: Works seamlessly with Claude Desktop and other MCP clients
Attribution
Powered by:
- Camelot (MIT License) — Table extraction logic
- PDFium (Apache 2.0) — PDF rendering engine
- Apify (Proprietary) — Actor deployment platform
No Ghostscript (AGPL) included. All dependencies are permissive or Apache 2.0 licensed.
Usage
Tool 1: extract_tables ($0.05 per call)
Extract raw tables from a PDF as 2D arrays.
Input:
{"pdf_url": "https://example.com/document.pdf","pages": "1,2,5-10", // Optional. Default: "all""flavor": "lattice" // Optional. "lattice" (default) or "stream"}
Output:
{"tables": [{"page": 1,"table_number": 0,"rows": 5,"columns": 3,"data": [["Header 1", "Header 2", "Header 3"],["Row 1 Col 1", "Row 1 Col 2", "Row 1 Col 3"],["Row 2 Col 1", "Row 2 Col 2", "Row 2 Col 3"]],"accuracy": 0.95}],"total_tables": 12,"pages_processed": 10}
Usage Example:
# Using Claude API with MCPcurl -X POST "https://api.ntriq.co.kr/ai/tools/extract_tables" \-H "Content-Type: application/json" \-H "X-YAP-Key: your_api_key" \-d '{"pdf_url": "https://example.com/report.pdf","pages": "1-5"}'
Tool 2: tables_to_json ($0.08 per call)
Extract tables and automatically convert to structured JSON with headers as object keys.
Input:
{"pdf_url": "https://example.com/document.pdf","pages": "all" // Optional. Default: "all"}
Output:
{"tables": [{"page": 1,"table_number": 0,"rows": 5,"columns": 3,"json_data": [{"Header 1": "Row 1 Col 1","Header 2": "Row 1 Col 2","Header 3": "Row 1 Col 3"},{"Header 1": "Row 2 Col 1","Header 2": "Row 2 Col 2","Header 3": "Row 2 Col 3"}],"accuracy": 0.95}],"total_tables": 12,"pages_processed": 10}
Usage Example:
# Perfect for downstream processingimport jsonresult = client.call_mcp_tool('tables_to_json', {'pdf_url': 'https://example.com/financial_report.pdf'})for table in result['tables']:df = pd.DataFrame(table['json_data'])print(df)
Extraction Methods
Lattice Method (Default)
- Best for: Structured tables with visible grid lines
- Uses: Line detection to identify cell boundaries
- Accuracy: Very high for formal reports, spreadsheets
- Speed: Fast
Stream Method
- Best for: Text-only tables without visible grid lines
- Uses: Whitespace and text positioning
- Accuracy: Good for loose, unformatted data
- Speed: Slightly slower
Technical Details
PDF Processing Pipeline
PDF URL / Base64↓[PDFium Renderer]↓[Camelot Extraction]↓[Format Conversion]↓JSON Output
Timeout Configuration
- API timeout: 120 seconds + 15-second buffer
- Recommended PDF size: < 50 MB
- Max pages per call: All pages supported (split large PDFs if needed)
Error Handling
{"success": false,"error": "Failed to download PDF from URL: Connection timeout after 30s","toolName": "extract_tables"}
Common errors:
Connection timeout— PDF URL unreachableInvalid PDF format— Corrupted file or non-PDF contentNo tables found— PDF contains no extractable tablesExtraction timeout— PDF too large or complex (retry with specific pages)
Pricing
| Tool | Cost | Use Case |
|---|---|---|
extract_tables | $0.05 | Raw 2D array extraction |
tables_to_json | $0.08 | Structured JSON with headers |
Billing: Per successful extraction. Errors are not charged.
Requirements
- Node.js: 20.0+
- Docker: For Apify deployment
- No external PDF tools: PDFium is bundled
Local Development
npm installnpm start# Test with local inputcat > input.json << 'EOF'{"pdf_url": "https://example.com/sample.pdf","pages": "all"}EOFAPIFY_LOCAL_STORAGE_DIR=./storage node src/main.js
Deployment
$apify push
Licensing
- Camelot: MIT License
- PDFium: Apache 2.0 License
- This Actor: MIT License
Compliance Note: This actor uses open-source, permissive licenses. No GPL/AGPL dependencies. Safe for commercial use.
Support
- Issues: Report via Apify platform or ntriq support
- Documentation: ai.ntriq.co.kr → /document/extract-tables API docs
- Rate limits: Handled by Apify (see account dashboard)
Changelog
v1.0.0 — 2026-03-29
- Initial release
- Two tools: extract_tables + tables_to_json
- PDFium backend with Lattice/Stream flavors
- PPE billing with Actor.charge() integration
Powered by Camelot (MIT) with PDFium (Apache 2.0)