Deprecated

Pricing

Pay per event + usage

See alternative Actors

Go to Apify Store

PDF Table Extractor - Convert to JSON & CSV

Deprecated

See alternative Actors

Pull tables out of PDF documents automatically. Convert to JSON or CSV for data analysis.

Pricing

Pay per event + usage

Rating

0.0

(0)

Developer

daehwan kim

Actor stats

Bookmarked

Total users

Monthly active users

6 days ago

Last modified

PDF Table Extract MCP

Extract tables from PDF documents and convert to structured JSON. No Ghostscript required.

Overview

This Apify MCP actor extracts tables from PDF documents using Camelot with PDFium backend (Apache 2.0 license). PDFium is a modern, open-source PDF engine with superior table extraction accuracy compared to legacy approaches.

Key Features

✅ Two extraction methods: Lattice detection (line-based) and Stream method (text-based)
✅ Structured output: Raw tables or auto-converted to JSON with headers as keys
✅ No GPL dependencies: Uses PDFium (Apache 2.0), NOT Ghostscript (AGPL-3.0)
✅ Page filtering: Extract specific pages or page ranges
✅ Pay-per-event pricing: $0.05–$0.08 per API call
✅ MCP integration: Works seamlessly with Claude Desktop and other MCP clients

Attribution

Powered by:

Camelot (MIT License) — Table extraction logic
PDFium (Apache 2.0) — PDF rendering engine
Apify (Proprietary) — Actor deployment platform

No Ghostscript (AGPL) included. All dependencies are permissive or Apache 2.0 licensed.

Usage

Tool 1: `extract_tables` ($0.05 per call)

Extract raw tables from a PDF as 2D arrays.

Input:

{
  "pdf_url": "https://example.com/document.pdf",
  "pages": "1,2,5-10",  // Optional. Default: "all"
  "flavor": "lattice"   // Optional. "lattice" (default) or "stream"
}

Output:

{
  "tables": [
    {
      "page": 1,
      "table_number": 0,
      "rows": 5,
      "columns": 3,
      "data": [
        ["Header 1", "Header 2", "Header 3"],
        ["Row 1 Col 1", "Row 1 Col 2", "Row 1 Col 3"],
        ["Row 2 Col 1", "Row 2 Col 2", "Row 2 Col 3"]
      ],
      "accuracy": 0.95
    }
  ],
  "total_tables": 12,
  "pages_processed": 10
}

Usage Example:

# Using Claude API with MCP
curl -X POST "https://api.ntriq.co.kr/ai/tools/extract_tables" \
  -H "Content-Type: application/json" \
  -H "X-YAP-Key: your_api_key" \
  -d '{
    "pdf_url": "https://example.com/report.pdf",
    "pages": "1-5"
  }'

Tool 2: `tables_to_json` ($0.08 per call)

Extract tables and automatically convert to structured JSON with headers as object keys.

Input:

{
  "pdf_url": "https://example.com/document.pdf",
  "pages": "all"  // Optional. Default: "all"
}

Output:

{
  "tables": [
    {
      "page": 1,
      "table_number": 0,
      "rows": 5,
      "columns": 3,
      "json_data": [
        {
          "Header 1": "Row 1 Col 1",
          "Header 2": "Row 1 Col 2",
          "Header 3": "Row 1 Col 3"
        },
        {
          "Header 1": "Row 2 Col 1",
          "Header 2": "Row 2 Col 2",
          "Header 3": "Row 2 Col 3"
        }
      ],
      "accuracy": 0.95
    }
  ],
  "total_tables": 12,
  "pages_processed": 10
}

Usage Example:

# Perfect for downstream processing
import json

result = client.call_mcp_tool('tables_to_json', {
  'pdf_url': 'https://example.com/financial_report.pdf'
})

for table in result['tables']:
  df = pd.DataFrame(table['json_data'])
  print(df)

Extraction Methods

Lattice Method (Default)

Best for: Structured tables with visible grid lines
Uses: Line detection to identify cell boundaries
Accuracy: Very high for formal reports, spreadsheets
Speed: Fast

Stream Method

Best for: Text-only tables without visible grid lines
Uses: Whitespace and text positioning
Accuracy: Good for loose, unformatted data
Speed: Slightly slower

Technical Details

PDF Processing Pipeline

PDF URL / Base64
    ↓
[PDFium Renderer]
    ↓
[Camelot Extraction]
    ↓
[Format Conversion]
    ↓
JSON Output

Timeout Configuration

API timeout: 120 seconds + 15-second buffer
Recommended PDF size: < 50 MB
Max pages per call: All pages supported (split large PDFs if needed)

Error Handling

{
  "success": false,
  "error": "Failed to download PDF from URL: Connection timeout after 30s",
  "toolName": "extract_tables"
}

Common errors:

Connection timeout — PDF URL unreachable
Invalid PDF format — Corrupted file or non-PDF content
No tables found — PDF contains no extractable tables
Extraction timeout — PDF too large or complex (retry with specific pages)

Pricing

Tool	Cost	Use Case
`extract_tables`	$0.05	Raw 2D array extraction
`tables_to_json`	$0.08	Structured JSON with headers

Billing: Per successful extraction. Errors are not charged.

Requirements

Node.js: 20.0+
Docker: For Apify deployment
No external PDF tools: PDFium is bundled

Local Development

npm install
npm start

# Test with local input
cat > input.json << 'EOF'
{
  "pdf_url": "https://example.com/sample.pdf",
  "pages": "all"
}
EOF
APIFY_LOCAL_STORAGE_DIR=./storage node src/main.js

Deployment

$apify push

Licensing

Camelot: MIT License
PDFium: Apache 2.0 License
This Actor: MIT License

Compliance Note: This actor uses open-source, permissive licenses. No GPL/AGPL dependencies. Safe for commercial use.

Support

Issues: Report via Apify platform or ntriq support
Documentation: ai.ntriq.co.kr → /document/extract-tables API docs
Rate limits: Handled by Apify (see account dashboard)

Changelog

v1.0.0 — 2026-03-29

Initial release
Two tools: extract_tables + tables_to_json
PDFium backend with Lattice/Stream flavors
PPE billing with Actor.charge() integration

Powered by Camelot (MIT) with PDFium (Apache 2.0)

Extend this actor with the ntriqpro intelligence network:

supply-chain-risk-mcp — MCP server for supply chain risk
video-intelligence-mcp — MCP server for video intelligence
content-factory-mcp — MCP server for content-factory

⭐ Love it? Leave a Review

Your rating helps professionals discover this actor. Rate it here.

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

Stas Persiianenko

PDF to Markdown Converter - Extract & Format Text

ntriqpro/pdf-to-markdown

Convert PDF documents to clean, readable markdown format. Perfect for documentation and knowledge bases.

daehwan kim

Privacy Stack

bikram786/privacy-stack

Privacy researcher & developer building production Apify actors for arXiv privacy research. Privacy Stack brings 1 5,00+ real arXiv privacy papers into one place ..carefully verified with no fake URLs & no duplicates. Categories : Internet Privacy Data Privacy Crypto Privacy Post-Quantum Privacy

Bikram Biswas

PDF to Text Extractor

junipr/pdf-to-text-extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

junipr

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

ParseForge

5.0

(1)

Realestate.com.kh Scraper | Cambodia Property Listings

haketa/realestate-com-kh-scraper

Scrape Cambodia's #1 property portal realestate.com.kh. Houses, condos, land & villas in Phnom Penh & Siem Reap. USD prices, Hard/Strata Title, Borey filter & foreign-eligible listings.

Haketa

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M

Website To Markdown

smart_api/website-to-markdown

Convert any webpage into clean, LLM-ready Markdown in seconds — perfect for AI training data, RAG pipelines, and content archiving.

SmartApi

5.0

(1)

Website to Markdown — Clean Pages for RAG & LLMs

ryanclinton/website-content-to-markdown

Crawl any website and convert its HTML pages into clean, well-structured Markdown text. Purpose-built for AI workflows, RAG pipelines, LLM fine-tuning, and documentation archival. Give it URLs and get back Markdown documents with navigation, ads, and boilerplate stripped away.

Ryan Clinton

Webpage to Markdown Converter

automation-lab/webpage-to-markdown-converter

Convert webpages to clean Markdown for LLM/RAG pipelines. Uses @mozilla/readability to strip ads, navigation, and footers. Outputs structured JSON.

Stas Persiianenko

PDF Table Extractor - Convert to JSON & CSV

PDF Table Extract MCP

Overview

Key Features

Attribution

Usage

Tool 1: extract_tables ($0.05 per call)

Tool 2: tables_to_json ($0.08 per call)

Extraction Methods

Lattice Method (Default)

Stream Method

Technical Details

PDF Processing Pipeline

Timeout Configuration

Error Handling

Pricing

Requirements

Local Development

Deployment

Licensing

Support

Changelog

🔗 Related Actors by ntriqpro

⭐ Love it? Leave a Review

You might also like

HTML Table Extractor

PDF to Markdown Converter - Extract & Format Text

Privacy Stack

PDF to Text Extractor

Wayback Machine Scraper - Track Website Changes Over Time

PDF To JSON Parser

Realestate.com.kh Scraper | Cambodia Property Listings

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

Website To Markdown

Website to Markdown — Clean Pages for RAG & LLMs

Webpage to Markdown Converter

Tool 1: `extract_tables` ($0.05 per call)

Tool 2: `tables_to_json` ($0.08 per call)