Pricing

from $0.00005 / actor start

Extract text from PDF

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Akash Kumar Naik

Actor stats

Bookmarked

108

Total users

Monthly active users

16 days ago

Last modified

PDF Text Extractor — Extract Text from Any PDF File

Extract text from PDF files with OCR support for scanned documents and image-based PDFs. Supports direct URLs and cloud storage links.

🎯 What It Does

Extract text from any PDF — digitally created or scanned
Cloud storage support — Google Drive, Dropbox, and OneDrive share links
OCR fallback — automatically runs Tesseract OCR on pages with no embedded text
Multi-language OCR — 100+ languages supported
Mistral AI OCR fallback — when both pdf.js-extract and Tesseract fail, optionally use Mistral OCR for state-of-the-art document understanding (tables, equations, complex layouts)
Page limiting — optionally cap extraction to a specific number of pages
Structured output — extracted text plus metadata (page count, source type, file size, timestamp)

📥 Input

{
  "pdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing",
  "maxPages": 0,
  "ocrFallback": true,
  "ocrLanguage": "eng",
  "mistralApiKey": "your-mistral-api-key"
}

Field	Type	Default	Description
`pdfUrl`	string	—	PDF URL or cloud storage share link (Google Drive, Dropbox, OneDrive)
`maxPages`	integer	`0`	Max pages to extract. `0` = all pages
`ocrFallback`	boolean	`true`	Run Tesseract OCR on pages with <50 chars of embedded text
`ocrLanguage`	string	`eng`	Tesseract language code (e.g. `fra`, `deu`, `chi_sim`)
`mistralApiKey`	string	—	Optional Mistral AI API key. When provided, if both pdf.js-extract and Tesseract fail to produce meaningful text, the PDF is sent to Mistral OCR for premium document understanding

📤 Output

{
  "originalPdfUrl": "https://drive.google.com/file/d/FILE_ID/view",
  "processedPdfUrl": "https://drive.google.com/uc?export=download&id=FILE_ID",
  "extractedText": "Full text content extracted from the PDF...",
  "pageCount": 12,
  "extractedPages": 12,
  "fileSizeBytes": 1048576,
  "sourceType": "google-drive",
  "ocrApplied": true,
  "mistralFallbackApplied": false,
  "timestamp": "2026-06-05T07:00:00.000Z",
  "success": true
}

🔍 How It Works

Downloads the PDF via HTTP with retry logic (3 attempts, exponential backoff)
Extracts embedded text using pdf.js-extract (fast for digital PDFs) — Stage 1
Tesseract OCR fallback (when enabled): pages with <50 chars are rendered to PNG at 300 DPI via pdftoppm, then processed with Tesseract OCR — Stage 2
Mistral OCR premium fallback (when mistralApiKey is set): if Stages 1 and 2 produce fewer than 200 characters total, the PDF URL is sent to Mistral AI's OCR API for state-of-the-art document understanding — Stage 3
Returns structured JSON with extracted text and metadata

💰 Pricing

Pay-per-event — charged only on successful extractions.

Event	Price	Trigger
`pdf-processed`	$0.005	Per successfully processed PDF
`page-extracted`	$0.0005	Per page (only when `extractedPages > 1`)

🚀 Use Cases

Document processing — invoices, contracts, reports, scanned paper copies
Research — academic papers, white papers, archival PDFs
Data pipelines — feed PDF content into NLP or search systems
Content management — index PDF archives for full-text search
Automation — process PDFs at scale via Apify API, Zapier, or Make

⚡ Tips

Scanned PDFs: ocrFallback: true is enabled by default — works out of the box
Large PDFs: set maxPages to limit processing time and cost
Non-English docs: set ocrLanguage to the matching Tesseract language code
Failed extractions: not charged — error details returned in errorMessage field

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

512

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

codemaster devops

5.0

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Jiří Moravčík

1.1K

AI Data Extraction from PDF

actor4you/ai-data-extraction-from-pdf

Extract text data from PDF files using AI. Upload PDFs directly or provide URLs. Supports text chunking for LLM workflows.

Actor4you

Pdf API

vivid_astronaut/pdf

Fabio Suizu

PDF Text Extractor

automation-lab/pdf-text-extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Stas Persiianenko

Pdf To Text Scraper

getdataforme/pdf-to-text-scraper

The Pdf To Text Scraper is an Apify Actor that efficiently extracts text from PDFs, preserving structure and supporting batch processing....

GetDataForMe

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

Ale

Fast Pdf Processor

contemporary_fruit/pdf-processor-actor

This API is a PDF Processing Service allowing users to upload a PDF to: Extract Text: Reads all text from the PDF and returns it as structured JSON data per page. Merge Pages: Creates a new PDF containing only the specific pages selected by the user. (260 characters)

Andric

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.