PDF Intelligence avatar

PDF Intelligence

Pricing

from $0.01 / 1,000 results

Go to Apify Store
PDF Intelligence

PDF Intelligence

Stop fighting PDFs. Extract text, tables, and insights from any document, scanned or digital. Get RAG-ready chunks for LangChain & LlamaIndex. AI-powered summaries, classification, entity extraction. Use our API keys or bring your own (50% discount). From PDF chaos to clean data in minutes.

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

Marielise

Marielise

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

1

Monthly active users

20 days ago

Last modified

Share

PDF Intelligence - AI-Powered PDF Analysis, OCR & RAG Preparation

Extract text, tables, and AI insights from any PDF in seconds.

Transform PDFs into structured, actionable data with AI-powered analysis. Extract text with 95%+ accuracy, automatically OCR scanned documents, detect tables with AI precision, and prepare content for RAG workflows.


Quick Start

Get results in 30 seconds:

  1. Click Start - the default example PDF runs automatically
  2. View results in the Output tab
  3. Switch to AI Analysis view for intelligent insights

No configuration needed for basic extraction!


What This Actor Does

Core Features

  • Text Extraction - Clean text from any PDF
  • AI-Powered OCR - Convert scanned PDFs to text
  • Table Detection - Extract structured table data
  • RAG Chunking - Split for vector databases
  • AI Analysis - Summary, entities, classification

Output Includes

  • Executive summary of document
  • Document type classification
  • Named entity extraction (people, orgs, dates)
  • Key topics and themes
  • Action items and recommendations
  • Quality score and confidence level

Pricing

Transparent pay-per-use pricing. Only pay for what you process.

Base Processing

EventPriceDescription
Page Processed$0.002Per PDF page extracted
Document Analyzed$0.01Metadata extraction
RAG Chunking$0.02Chunk preparation

AI Features (Require API Key)

EventPriceDescription
OCR Page$0.03AI Vision OCR per page
AI Table Extraction$0.015Intelligent table detection
AI Document Analysis$0.04Full AI analysis

Pricing Examples

Use CaseWhat You GetCost
10-page PDF text extractionText + metadata~$0.03
50-page PDF with AI analysisText + AI insights~$0.14
RAG preparation (20 pages)Chunks ready for vectors~$0.06
Scanned PDF OCR (5 pages)OCR text + analysis~$0.19

How to Use

Option 1: Apify Console (Easiest)

  1. Enter your PDF URL in the PDF URL field
  2. Select an action (Extract Text, Extract Tables, etc.)
  3. Click Start
  4. View results in the Output tab

Option 2: Apify API

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('cvs/pdf-intelligence').call({
pdfUrl: 'https://example.com/document.pdf',
action: 'full_analysis'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0]);

Option 3: Direct HTTP API

curl -X POST "https://api.apify.com/v2/acts/cvs~pdf-intelligence/runs?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"pdfUrl": "https://example.com/document.pdf",
"action": "extract_text"
}'

Option 4: Claude Desktop (MCP)

Add to your Claude Desktop config:

{
"mcpServers": {
"pdf-intelligence": {
"url": "https://cvs--pdf-intelligence.apify.actor/mcp",
"headers": {
"Authorization": "Bearer YOUR_APIFY_TOKEN"
}
}
}
}

Input Parameters

Basic Settings

ParameterTypeDefaultDescription
pdfUrlstringExample PDFURL of PDF to process
pdfContentstring-Base64-encoded PDF (alternative to URL)
actionstringextract_textWhat to extract (see below)
maxPagesinteger0 (all)Limit pages to process

Actions Available

ActionDescription
extract_textGet all text content with page markers
extract_tablesExtract tabular data as JSON/CSV/Markdown
get_metadataDocument properties (title, author, dates)
chunk_for_ragSplit into chunks for vector databases
full_analysisAll of the above combined

AI Configuration

ParameterTypeDescription
googleApiKeystringGoogle API key for Gemini (recommended)
openaiApiKeystringOpenAI key for GPT-4 Vision OCR
anthropicApiKeystringAnthropic key for Claude Vision
preferredAiProviderstring"auto", "gemini", "openai", or "anthropic"

Output Format

Example Output

{
"success": true,
"overview": {
"summary": "Technical report on web accessibility guidelines...",
"documentType": "technical",
"keyFindings": ["Contains accessibility standards", "Includes implementation examples"],
"confidence": "high"
},
"stats": {
"pageCount": 12,
"wordCount": 3450,
"tableCount": 3,
"chunkCount": 15,
"processingTimeMs": 2340
},
"quality": {
"score": 92,
"issues": [],
"recommendations": []
},
"content": {
"text": "Full extracted text...",
"tables": [...],
"metadata": {...}
}
}

Output Views in Console

ViewWhat It Shows
SummaryAI-generated executive summary
AI AnalysisEntities, topics, action items
Quality ReportScore, confidence, recommendations
MetadataTitle, author, dates, page count
ContentExtracted text and tables
RAG ChunksPrepared chunks for vector DBs
Full OutputComplete raw JSON

Use Cases

📄 Invoice Processing

Extract line items, totals, and vendor information automatically.

{
"pdfUrl": "https://example.com/invoice.pdf",
"action": "extract_tables",
"googleApiKey": "your-key"
}

📋 Contract Analysis

Extract key clauses, parties, dates, and obligations from legal documents.

{
"pdfUrl": "https://example.com/contract.pdf",
"action": "full_analysis",
"googleApiKey": "your-key"
}

📚 Research Paper RAG

Chunk academic papers with semantic awareness for better retrieval.

{
"pdfUrl": "https://example.com/paper.pdf",
"action": "chunk_for_rag",
"chunkSize": 500,
"semanticChunking": true,
"googleApiKey": "your-key"
}

🔍 Scanned Document OCR

Convert scanned PDFs to searchable text.

{
"pdfUrl": "https://example.com/scanned.pdf",
"action": "extract_text",
"enableOcr": true,
"googleApiKey": "your-key"
}

FAQ


Limitations

LimitationDetails
Max file size50MB
Output truncationText: 100k chars, Chunks: 50 items (full data in dataset)
OCR requirementRequires AI API key and embedded images in PDF
Rate limit100 requests/minute per client
Memory4GB recommended, up to 16GB for large documents

Error Codes

CodeDescriptionSolution
VALIDATION_ERRORInvalid inputCheck parameter types and values
INVALID_PDFCorrupted PDFEnsure PDF is valid and not encrypted
PROCESSING_ERRORRuntime errorRetry the request
RESOURCE_LIMITFile too largeUse smaller file or increase memory
RATE_LIMIT_EXCEEDEDToo many requestsWait and retry

Technical Details

  • Runtime: Node.js 22
  • Memory: 4GB default, 16GB max
  • PDF Libraries: pdf-parse, pdf-lib
  • AI Models: Gemini 2.5 Flash, GPT-4V, Claude Vision
  • Protocols: MCP (Model Context Protocol), REST API

Changelog

v3.0.0

  • AI Document Analysis with executive summary, entities, and classification
  • 7 specialized output views in Apify Console
  • Memory-efficient streaming for 100+ page documents
  • Gemini 2.5 Flash as default AI provider

v2.1.0

  • AI-powered OCR with Vision APIs
  • Semantic chunking with AI boundary detection
  • Multi-provider AI support (OpenAI, Anthropic, Gemini)

v2.0.0

  • Dual operation modes (One Click and BYOK)
  • HTTP REST API for external clients
  • Pay-per-event pricing model

Support

  • Issues: Report bugs on GitHub
  • Questions: Contact via Apify Console
  • Documentation: This README and input schema tooltips

Built with ❤️ using Apify SDK