PDF Intelligence avatar
PDF Intelligence

Pricing

from $0.01 / 1,000 results

Go to Apify Store
PDF Intelligence

PDF Intelligence

Stop fighting PDFs. Extract text, tables, and insights from any document, scanned or digital. Get RAG-ready chunks for LangChain & LlamaIndex. AI-powered summaries, classification, entity extraction. Use our API keys or bring your own (50% discount). From PDF chaos to clean data in minutes.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Marielise

Marielise

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

a day ago

Last modified

Share

PDF Intelligence - AI-Powered PDF Analysis, OCR & RAG Preparation

Extract text, tables and insights from PDFs using AI. Includes OCR for scanned documents and RAG-ready chunking.

Features

  • AI Document Analysis: Comprehensive analysis with executive summary, document classification, entity extraction, key topics, and action items
  • Text Extraction: Extract clean text from PDFs with optional page markers
  • AI-Powered OCR: Use Gemini Vision, GPT-4 Vision, or Claude Vision for scanned PDFs
  • Table Detection: Detect and extract tabular data with optional AI enhancement
  • Metadata Extraction: Get document information like title, author, dates
  • RAG Chunking: Split documents into semantically-aware chunks for vector databases
  • Semantic Chunking: AI-powered boundary detection for optimal RAG retrieval
  • Large Document Support: Memory-efficient streaming for documents with 100+ pages
  • Multiple Output Views: Summary, AI Analysis, Quality Report, Metadata, Content, RAG Chunks
  • Dual Operation Modes: One Click (zero config) or BYOK (bring your own keys)
  • HTTP REST API: External API access in addition to MCP protocol
  • Pay-Per-Event Pricing: Transparent PPE pricing with BYOK discounts

Operation Modes

One Click (Default)

Zero configuration. Platform-managed services with standard PPE billing.

  • Just works out of the box
  • No API keys needed for basic features
  • Standard pricing
  • AI features require API keys (see BYOK mode)

BYOK (Bring Your Own Keys)

Use your own API keys for AI features and discounted pricing.

  • Up to 50% savings on platform fees (configurable discount)
  • Provide your own OpenAI, Anthropic, or Gemini API keys
  • Auto-detected when any API key is provided
  • Required for AI features (OCR, AI extraction, semantic chunking)

AI Features

AI Document Analysis (Automatic)

Comprehensive AI-powered document analysis that runs automatically when AI keys are configured.

  • Supported Providers: Gemini (preferred), OpenAI, Anthropic
  • Cost: $0.04 per document
  • Output Includes:
    • Executive summary
    • Document type classification
    • Key topics and themes
    • Named entity extraction (people, organizations, dates, locations)
    • Action items and recommendations
    • Key findings and insights
    • Language detection

AI-Powered OCR (enableOcr)

Use Vision APIs to extract text from scanned or image-based PDFs.

  • Supported Providers: Gemini Vision (preferred), OpenAI (GPT-4V), Anthropic (Claude Vision)
  • Cost: $0.03 per page
  • Fallback: Tries providers in order until one succeeds
  • Note: Requires PDF pages to contain extractable embedded images

AI Table Extraction (useAiExtraction)

Use AI to intelligently detect and structure tables from document text.

  • Supported Providers: Gemini (preferred), OpenAI, Anthropic
  • Cost: $0.015 per operation
  • Benefits: Better accuracy than rule-based detection, handles complex layouts

Semantic Chunking (semanticChunking)

Use AI to find natural semantic boundaries for document chunking.

  • Supported Providers: Gemini (preferred), OpenAI, Anthropic
  • Cost: $0.015 per document
  • Benefits: Improved RAG retrieval by splitting at conceptual breakpoints

Tools

extract_text

Extract text content from a PDF document.

Input Schema:

ParameterTypeRequiredDefaultDescription
contentstringNo*-Base64-encoded PDF content
urlstringNo*-URL to fetch PDF from
pagesnumber[]NoallSpecific pages to extract (1-indexed)
preserveLayoutbooleanNofalsePreserve original text layout and spacing
includePageNumbersbooleanNotrueInclude page number markers in output
enableOcrbooleanNofalseUse AI Vision OCR for scanned PDFs

*Either content or url must be provided.

Output:

{
"success": true,
"overview": {
"summary": "AI-generated executive summary of the document...",
"documentType": "Technical Report",
"pageCount": 5,
"characterCount": 12500
},
"data": {
"text": "Document content here...",
"pageCount": 5,
"extractedPages": [1, 2, 3, 4, 5],
"characterCount": 12500,
"aiAnalysis": {
"executiveSummary": "Comprehensive summary...",
"documentType": "Technical Report",
"keyTopics": ["topic1", "topic2"],
"entities": {
"people": ["John Doe"],
"organizations": ["Acme Corp"],
"dates": ["2024-01-15"]
},
"actionItems": ["Review section 3", "Follow up on findings"]
}
},
"intelligence": {
"qualityScore": 95,
"confidenceLevel": "high",
"recommendations": []
},
"warnings": []
}

extract_tables

Extract tables from a PDF document.

Input Schema:

ParameterTypeRequiredDefaultDescription
contentstringNo*-Base64-encoded PDF content
urlstringNo*-URL to fetch PDF from
pagesnumber[]NoallSpecific pages to process
outputFormatstringNo"json"Output format: "json", "csv", or "markdown"
detectHeadersbooleanNotrueAttempt to detect table headers
useAiExtractionbooleanNofalseUse AI for intelligent table detection

*Either content or url must be provided.

Output:

{
"success": true,
"tables": [
{
"page": 1,
"tableIndex": 0,
"headers": ["Name", "Value", "Date"],
"rows": [["Item 1", "100", "2024-01-01"]],
"rowCount": 1,
"columnCount": 3
}
],
"tableCount": 1,
"warnings": []
}

get_metadata

Extract metadata from a PDF document.

Input Schema:

ParameterTypeRequiredDescription
contentstringNo*Base64-encoded PDF content
urlstringNo*URL to fetch PDF from

*Either content or url must be provided.

Output:

{
"success": true,
"metadata": {
"title": "Document Title",
"author": "Author Name",
"subject": "Subject",
"keywords": "keyword1, keyword2",
"creator": "Creator App",
"producer": "PDF Producer",
"creationDate": "2024-01-01T12:00:00",
"modificationDate": "2024-01-15T09:30:00",
"pageCount": 10,
"pdfVersion": "1.7",
"isEncrypted": false,
"isLinearized": true
},
"warnings": []
}

chunk_for_rag

Split PDF content into chunks optimized for RAG (Retrieval Augmented Generation).

Input Schema:

ParameterTypeRequiredDefaultDescription
contentstringNo*-Base64-encoded PDF content
urlstringNo*-URL to fetch PDF from
chunkSizenumberNo1000Target chunk size in characters (100-10000)
chunkOverlapnumberNo100Overlap between chunks (0-500)
splitByPagebooleanNofalseNever split chunks across page boundaries
includeMetadatabooleanNotrueInclude position metadata in chunks
semanticChunkingbooleanNofalseUse AI for semantic boundary detection

*Either content or url must be provided.

Output:

{
"success": true,
"chunks": [
{
"id": "chunk_0",
"text": "Chunk content here...",
"metadata": {
"pageNumber": 1,
"chunkIndex": 0,
"startChar": 0,
"endChar": 1000,
"totalChunks": 5
}
}
],
"chunkCount": 5,
"totalCharacters": 4800,
"averageChunkSize": 960,
"warnings": []
}

Output Views

When viewing results in the Apify Console, you can switch between different views:

ViewDescription
SummaryAI-generated executive summary and key insights
AI AnalysisFull AI analysis with entities, topics, and action items
Quality ReportQuality score, confidence level, and recommendations
MetadataDocument metadata (title, author, dates, page count)
ContentExtracted text content and page information
RAG ChunksPrepared chunks for vector database ingestion
Full OutputComplete raw output with all data

HTTP REST API

When HTTP API is enabled (default), you can call tools via REST endpoints:

Endpoints

EndpointMethodDescription
/healthGETHealth check
/infoGETActor info and available tools
/apiPOSTExecute a tool

Example Request

curl -X POST https://your-username--pdf-processor-mcp.apify.actor/api \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_APIFY_TOKEN" \
-d '{
"tool": "extract_text",
"args": {
"content": "<base64-pdf>",
"pages": [1, 2, 3],
"preserveLayout": false,
"enableOcr": true
},
"includeBilling": true
}'

Example Response

{
"success": true,
"result": {
"success": true,
"text": "Document content...",
"pageCount": 3,
"extractedPages": [1, 2, 3],
"characterCount": 5000,
"warnings": ["OCR completed for 3 pages using AI Vision"]
},
"billing": {
"totalCharged": 0.048,
"operationMode": "byok",
"discount": 50,
"events": [
{ "event": "page-processed", "count": 3, "price": 0.0015 },
{ "event": "ocr-page", "count": 3, "price": 0.0225 }
]
},
"rateLimit": {
"remaining": 97,
"limit": 100,
"resetIn": 60
}
}

Pricing

Base Events (One Click / BYOK with 50% discount)

EventOne ClickBYOK (50% off)Description
page-processed$0.002$0.001Per PDF page processed
document-analyzed$0.01$0.005For metadata extraction
rag-chunking$0.02$0.01For RAG chunking operation
api-call$0.0002$0.0001Per HTTP REST API call
large-data-operation$0.01$0.005For PDFs larger than 1MB

AI Events (Requires API Keys)

EventOne ClickBYOK (50% off)Description
ocr-page$0.03$0.015Per page OCR with Vision API
ai-table-extraction$0.015$0.0075AI table detection
semantic-chunking$0.015$0.0075AI semantic boundary detection
ai-document-analysis$0.04$0.02Comprehensive AI document analysis

Note: AI events require at least one API key (OpenAI, Anthropic, or Gemini) to be configured. In BYOK mode, you pay for the AI API usage directly to the provider, so the BYOK discount applies only to platform fees.

Pricing Examples

Use CaseEventsEstimated Cost
10-page PDF text extraction10 pages + analysis~$0.06
50-page PDF with AI analysis50 pages + AI analysis~$0.14
RAG preparation (20 pages)20 pages + chunking~$0.06
Scanned PDF OCR (5 pages)5 OCR pages + analysis~$0.19

Configuration

Input Parameters

ParameterTypeDefaultDescription
operationModestring"one-click""one-click" or "byok"
googleApiKeystring-Google API key for Gemini AI (recommended)
openaiApiKeystring-OpenAI API key (enables GPT-4V for OCR)
anthropicApiKeystring-Anthropic API key (enables Claude Vision)
preferredAiProviderstring"auto""auto" (Gemini first), "gemini", "openai", or "anthropic"
enableHttpApibooleantrueEnable HTTP REST API
apiRateLimitinteger100Max requests per minute per client
debugbooleanfalseEnable debug logging

Recommended Setup: For best results, provide a Google API key (Gemini). Gemini 2.5 Flash offers excellent performance at competitive pricing and is the default preferred provider.

Limitations

  • Maximum file size: 50MB
  • Large document handling: Documents with 100+ pages use memory-efficient streaming mode
  • Output truncation: Text output limited to 100k characters; RAG chunks limited to 50 in direct output (full data available via Apify dataset API)
  • OCR limitations: OCR works best with PDFs that contain embedded images. For PDFs that require rendering (vector graphics, native text rendered as images), a PDF-to-image conversion step may be needed first.
  • Rate limit: 100 requests/minute per client (configurable)
  • Memory: Recommended 4GB+ for large documents; Actor supports up to 16GB

Error Codes

CodeDescriptionRetryable
VALIDATION_ERRORInvalid input parametersNo
INVALID_PDFPDF file is corrupted or invalidNo
PROCESSING_ERRORError during processingYes
RESOURCE_LIMITFile size or page limit exceededNo
RATE_LIMIT_EXCEEDEDToo many API requestsYes

Use Cases

1. Contract Analysis with OCR

Extract text from scanned contracts for AI-powered clause analysis.

{
"tool": "extract_text",
"args": {
"url": "https://example.com/scanned-contract.pdf",
"preserveLayout": true,
"enableOcr": true
}
}

2. Invoice Data Extraction with AI

Extract tables from invoices with AI-powered accuracy.

{
"tool": "extract_tables",
"args": {
"content": "<base64-invoice>",
"detectHeaders": true,
"useAiExtraction": true
}
}

3. Research Paper RAG Processing

Chunk research papers with semantic awareness for better retrieval.

{
"tool": "chunk_for_rag",
"args": {
"content": "<base64-paper>",
"chunkSize": 500,
"chunkOverlap": 50,
"semanticChunking": true
}
}

4. Document Cataloging

Extract metadata for organizing document libraries.

{
"tool": "get_metadata",
"args": {
"content": "<base64-document>"
}
}

Local Development

# Install dependencies
npm install
# Build
npm run build
# Run locally in Apify standby mode
npm run dev

Connect to Claude Desktop

Add to ~/.config/claude/claude_desktop_config.json:

{
"mcpServers": {
"pdf-processor": {
"url": "https://your-username--pdf-processor-mcp.apify.actor/mcp",
"headers": {
"Authorization": "Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Technical Details

  • Runtime: Node.js 22
  • Transport: stdio (MCP standard) + HTTP REST API
  • PDF Libraries: pdf-parse, pdf-lib
  • AI SDKs: Google Generative AI (Gemini 2.5 Flash), OpenAI, Anthropic
  • Validation: Zod schemas for all inputs
  • Memory Management: 7GB heap with garbage collection optimization

Changelog

v3.0.0

  • AI Document Analysis: New comprehensive analysis with executive summary, classification, entities, topics, and action items
  • Output Views: Added 7 specialized views in Apify Console (Summary, AI Analysis, Quality Report, Metadata, Content, RAG Chunks, Full Output)
  • Large Document Support: Memory-efficient streaming for 100+ page documents
  • Output Optimization: Text truncation (100k chars) and chunk limiting (50) to prevent OOM
  • Updated Pricing: Adjusted pricing for sustainable 70%+ margins
  • Gemini 2.5 Flash: Updated to latest Gemini model as preferred AI provider
  • Improved Memory Management: 7GB heap limit with garbage collection optimization

v2.1.0

  • Added AI-powered OCR using Vision APIs (GPT-4V, Claude Vision, Gemini Vision)
  • Added AI-powered table extraction for better accuracy
  • Added semantic chunking with AI boundary detection
  • Added preferredAiProvider configuration
  • Updated pricing model with AI-specific events
  • Improved error messages and warnings

v2.0.0

  • Added dual operation modes (One Click and BYOK)
  • Added HTTP REST API for external clients
  • Added BYOK support with configurable discounts
  • Added coherent PPE pricing model
  • Added rate limiting for HTTP API

v1.0.0

  • Initial release with MCP tools

License

MIT