Extract text from PDF
Pricing
from $0.00005 / actor start
Extract text from PDF
Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.
Pricing
from $0.00005 / actor start
Rating
0.0
(0)
Developer

Akash Kumar Naik
Actor stats
1
Bookmarked
50
Total users
4
Monthly active users
15 days ago
Last modified
Categories
Share
PDF Text Extractor β Extract Text from Any PDF File
PDF Text Extractor is an Apify Actor that extracts text from PDF files. It supports direct URLs and Google Drive / Dropbox / OneDrive share links. For PDFs that have no embedded text (scanned documents, image-based PDFs), an optional OCR fallback powered by Tesseract renders each image-only page and reads the text from the rendered image.
π Key Features
- Direct URL support β Fetch and extract text from any publicly accessible PDF URL
- Google Drive β Auto-converts Google Drive share links to direct download URLs
- Dropbox & OneDrive β Converts Dropbox and OneDrive share links automatically
- OCR fallback β Tesseract OCR on pages with no embedded text (scanned / image PDFs)
- Multi-language OCR β 100+ languages supported via Tesseract trained data
- Page limiting β Optionally cap extraction at a specific number of pages
- Page labels in output β When
maxPagesis set, each page's text is prefixed with[Page N] - Retry logic β Up to 3 attempts with 1.5Γ exponential backoff on network failures
- Rotating user agents β Uses randomised browser user-agent headers to avoid blocks
- Pay-per-event pricing β Pay only for successful extractions
- Structured JSON output β Extracted text plus metadata (page count, file size, source type, OCR flag, timestamp)
π₯ Input Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
pdfUrl | string | β Yes | β | URL of the PDF. Accepts direct URLs, or Google Drive / Dropbox / OneDrive share links. |
maxPages | integer | No | 0 | Maximum pages to extract. 0 = all pages. Range: 0β10 000. When > 0, each page's text is prefixed with [Page N]. |
ocrFallback | boolean | No | false | Enable OCR for pages with fewer than 50 characters of embedded text. Required for scanned or image-based PDFs. Increases processing time and memory usage. |
ocrLanguage | string | No | eng | Tesseract language code. Only used when ocrFallback is true. See supported languages. |
Example Input β Standard PDF
{"pdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing","maxPages": 0}
Example Input β Scanned / Image PDF with OCR
{"pdfUrl": "https://example.com/scanned-document.pdf","ocrFallback": true,"ocrLanguage": "eng"}
π€ Output Format
Each run pushes one item to the default dataset.
Successful extraction (text layer)
{"originalPdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing","processedPdfUrl": "https://drive.google.com/uc?export=download&id=FILE_ID","extractedText": "Full text content extracted from the PDF...","pageCount": 12,"extractedPages": 12,"fileSizeBytes": 1048576,"sourceType": "google-drive","ocrApplied": false,"timestamp": "2026-02-18T07:00:00.000Z","success": true}
Successful extraction (OCR applied to some pages)
{"originalPdfUrl": "https://example.com/scanned.pdf","processedPdfUrl": "https://example.com/scanned.pdf","extractedText": "Text recovered by OCR from the scanned pages...","pageCount": 5,"extractedPages": 5,"fileSizeBytes": 2097152,"sourceType": "direct-url","ocrApplied": true,"timestamp": "2026-02-18T07:00:00.000Z","success": true}
Failed extraction
{"originalPdfUrl": "https://example.com/missing.pdf","processedPdfUrl": "https://example.com/missing.pdf","extractedText": "","pageCount": 0,"extractedPages": 0,"fileSizeBytes": 0,"sourceType": "direct-url","ocrApplied": false,"timestamp": "2026-02-18T07:00:00.000Z","success": false,"errorMessage": "HTTP 404: Not Found"}
Output field reference
| Field | Type | Description |
|---|---|---|
originalPdfUrl | string | The input URL exactly as provided |
processedPdfUrl | string | The URL actually used to download the file (after cloud-link conversion) |
extractedText | string | Full extracted text. When maxPages > 0, each page is prefixed with [Page N] |
pageCount | integer | Total number of pages in the PDF |
extractedPages | integer | Pages actually extracted (β€ pageCount when maxPages is set) |
fileSizeBytes | integer | Downloaded file size in bytes |
sourceType | string | One of: direct-url, google-drive, dropbox, onedrive |
ocrApplied | boolean | true if OCR was used on at least one page |
timestamp | string | ISO 8601 timestamp of when extraction completed |
success | boolean | true if extraction succeeded |
errorMessage | string | Present only on failure; describes the error |
π How Text Extraction Works
Step 1 β Text layer extraction (always)
pdf.js-extract parses the PDF and reads the embedded text layer from every page. This is fast and accurate for digitally-created PDFs.
Step 2 β OCR fallback (optional, ocrFallback: true)
Any page that returns fewer than 50 characters of embedded text is treated as an image-only page. The Actor:
- Renders that page to a PNG image at 300 DPI using
pdftoppm(frompoppler-utils) - Runs Tesseract OCR (
tesseract.js) on the image - Replaces the page's text with the OCR result if it yields more characters
This hybrid approach means both regular and scanned PDFs are handled in a single run. Pages that already have good embedded text are never sent to OCR, keeping processing fast.
OCR accuracy notes:
- Works best on clean, high-contrast scans
- Handwritten text accuracy varies by quality
- Complex multi-column layouts may have word-ordering issues
- Non-Latin scripts require the matching
ocrLanguagecode
π Supported OCR Language Codes
Pass any Tesseract language code as ocrLanguage:
| Language | Code |
|---|---|
| English | eng |
| French | fra |
| German | deu |
| Spanish | spa |
| Portuguese | por |
| Italian | ita |
| Chinese (Simplified) | chi_sim |
| Chinese (Traditional) | chi_tra |
| Japanese | jpn |
| Korean | kor |
| Arabic | ara |
| Hindi | hin |
| Russian | rus |
βοΈ Supported Link Types
| Source | Example |
|---|---|
| Direct URL | https://example.com/document.pdf |
| Google Drive | https://drive.google.com/file/d/FILE_ID/view |
| Google Drive (open) | https://drive.google.com/open?id=FILE_ID |
| Dropbox | https://www.dropbox.com/s/HASH/filename.pdf |
OneDrive (1drv.ms) | https://1drv.ms/b/s!SHARE_ID |
Cloud share links are automatically converted to direct download URLs before fetching.
π° Pricing
Pay-per-event pricing β charged only on successful extractions.
| Event | Price | Trigger |
|---|---|---|
pdf-processed | $0.005 | Per successfully processed PDF |
page-extracted | $0.0005 | Per page extracted (only when extractedPages > 1) |
Cost examples
| Cost | |
|---|---|
| 1-page PDF | $0.0050 |
| 3-page PDF | $0.0065 ($0.005 + 3 Γ $0.0005) |
| 10-page PDF | $0.0100 ($0.005 + 10 Γ $0.0005) |
Failed extractions are not charged. Spending limits can be controlled via
ACTOR_MAX_TOTAL_CHARGE_USD.
βοΈ Technical Details
| Property | Value |
|---|---|
| Runtime | Node.js 20 |
| Actor version | 1.4 |
| Memory | 256 MB min β 512 MB max |
| Download timeout | 120 seconds (fixed) |
| Max pages input | 0β10 000 (0 = all pages) |
| PDF library | pdf.js-extract ^0.2.1 |
| OCR engine | tesseract.js ^5.1.0 (LSTM mode) |
| PDF-to-image | pdftoppm from poppler-utils at 300 DPI |
| OCR threshold | Pages with < 50 embedded chars |
| HTTP client | node-fetch ^2.7.0 |
| Retry attempts | 3 (1.5Γ exponential backoff, 1.5 s base delay) |
| Output format | JSON dataset |
π§ Use Cases
- Document processing β Invoices, contracts, reports, forms (including scanned paper copies)
- Research β Extract text from academic papers, white papers, and archival PDFs
- Data pipelines β Feed PDF content into downstream NLP or search systems
- Content management β Index PDF archives for full-text search
- Automation β Process PDFs at scale via the Apify API or Zapier/Make integrations
- Historical documents β OCR old scanned records and books
π Integration
| Platform | Details |
|---|---|
| Apify API | Full REST API access |
| Apify SDK (Python / Node.js) | Official SDKs supported |
| Zapier | Connect with 5 000+ apps |
| Make (Integromat) | Visual workflow automation |
| Webhooks | Real-time completion notifications |
π Security & Privacy
- Processing runs inside Apify's secure cloud infrastructure
- Data is not persisted beyond the Actor run's dataset retention period
- All transfers use HTTPS
- Spending limits enforced via Apify's pay-per-event system
π οΈ Local Development
# Install dependenciesnpm install# Run Actor locally (reads from input.json)apify run --input-file input.json# Validate input/output schemasapify validate-schema# Deploy to Apify platformapify push# Run on the platformapify call
Local run storage is written to ./storage/ (git-ignored).
Note for local OCR testing:
pdftoppmmust be installed locally. The Apify platform uses Alpine Linux and installs it viaapk add poppler-utilsautomatically. For local dev:brew install poppler(macOS),apk add poppler-utils(Alpine),apt-get install poppler-utils(Debian/Ubuntu), orchoco install poppler(Windows).
π Quick Start
- Open the Actor on Apify Store
- Click Try for free
- Enter a PDF URL in the
pdfUrlfield - For scanned PDFs, toggle Enable OCR Fallback to
true - Click Start and wait for the run to finish
- Download extracted text from the Dataset tab in JSON, CSV, or XLSX format
Ready to extract text from any PDF? Start using PDF Text Extractor β
