Pricing

from $0.00005 / actor start

Extract text from PDF

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Akash Kumar Naik

Actor stats

Bookmarked

110

Total users

Monthly active users

a month ago

Last modified

PDF Text Extractor — Extract Text from Any PDF File

Extract text from PDF files with OCR support for scanned documents and image-based PDFs. Supports direct URLs and cloud storage links.

🎯 What It Does

Extract text from any PDF — digitally created or scanned
Cloud storage support — Google Drive, Dropbox, and OneDrive share links
OCR fallback — automatically runs Tesseract OCR on pages with no embedded text
Multi-language OCR — 100+ languages supported
Mistral AI OCR fallback — when both pdf.js-extract and Tesseract fail, optionally use Mistral OCR for state-of-the-art document understanding (tables, equations, complex layouts)
Page limiting — optionally cap extraction to a specific number of pages
Structured output — extracted text plus metadata (page count, source type, file size, timestamp)

📥 Input

{
  "pdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing",
  "maxPages": 0,
  "ocrFallback": true,
  "ocrLanguage": "eng",
  "mistralApiKey": "your-mistral-api-key"
}

Field	Type	Default	Description
`pdfUrl`	string	—	PDF URL or cloud storage share link (Google Drive, Dropbox, OneDrive)
`maxPages`	integer	`0`	Max pages to extract. `0` = all pages
`ocrFallback`	boolean	`true`	Run Tesseract OCR on pages with <50 chars of embedded text
`ocrLanguage`	string	`eng`	Tesseract language code (e.g. `fra`, `deu`, `chi_sim`)
`mistralApiKey`	string	—	Optional Mistral AI API key. When provided, if both pdf.js-extract and Tesseract fail to produce meaningful text, the PDF is sent to Mistral OCR for premium document understanding

📤 Output

{
  "originalPdfUrl": "https://drive.google.com/file/d/FILE_ID/view",
  "processedPdfUrl": "https://drive.google.com/uc?export=download&id=FILE_ID",
  "extractedText": "Full text content extracted from the PDF...",
  "pageCount": 12,
  "extractedPages": 12,
  "fileSizeBytes": 1048576,
  "sourceType": "google-drive",
  "ocrApplied": true,
  "mistralFallbackApplied": false,
  "timestamp": "2026-06-05T07:00:00.000Z",
  "success": true
}

🔍 How It Works

Downloads the PDF via HTTP with retry logic (3 attempts, exponential backoff)
Extracts embedded text using pdf.js-extract (fast for digital PDFs) — Stage 1
Tesseract OCR fallback (when enabled): pages with <50 chars are rendered to PNG at 300 DPI via pdftoppm, then processed with Tesseract OCR — Stage 2
Mistral OCR premium fallback (when mistralApiKey is set): if Stages 1 and 2 produce fewer than 200 characters total, the PDF URL is sent to Mistral AI's OCR API for state-of-the-art document understanding — Stage 3
Returns structured JSON with extracted text and metadata

💰 Pricing

Pay-per-event — charged only on successful extractions.

Event	Price	Trigger
`pdf-processed`	$0.005	Per successfully processed PDF
`page-extracted`	$0.0005	Per page (only when `extractedPages > 1`)

🚀 Use Cases

Document processing — invoices, contracts, reports, scanned paper copies
Research — academic papers, white papers, archival PDFs
Data pipelines — feed PDF content into NLP or search systems
Content management — index PDF archives for full-text search
Automation — process PDFs at scale via Apify API, Zapier, or Make

⚡ Tips

Scanned PDFs: ocrFallback: true is enabled by default — works out of the box
Large PDFs: set maxPages to limit processing time and cost
Non-English docs: set ocrLanguage to the matching Tesseract language code
Failed extractions: not charged — error details returned in errorMessage field

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

515

📄 PDF Text Extractor

scrapio/pdf-text-extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.

Scrapio

📄 PDF Text Extractor

api-empire/pdf-text-extractor

📄 PDF Text Extractor effortlessly converts PDF files into searchable text and clean output. ⚡ Fast, accurate, and user-friendly—ideal for document analysis, data extraction, and content indexing. 🚀 Perfect for research, compliance, and automation.

API Empire

📄 PDF Text Extractor

scraper-engine/pdf-text-extractor

📄✨ PDF Text Extractor extracts clean text from PDF files with precision. ⚡ Perfect for data mining, document processing, and searchable archives. 🚀 Fast, reliable, and efficient for your workflow!

Scraper Engine

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

codemaster devops

5.0

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Jiří Moravčík

1.1K

AI Data Extraction from PDF

actor4you/ai-data-extraction-from-pdf

Extract text data from PDF files using AI. Upload PDFs directly or provide URLs. Supports text chunking for LLM workflows.

Actor4you

Pdf API

vivid_astronaut/pdf

Fabio Suizu

PDF Text Extractor

automation-lab/pdf-text-extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.