Extract text from PDF avatar

Extract text from PDF

Pricing

from $0.00005 / actor start

Go to Apify Store
Extract text from PDF

Extract text from PDF

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Akash Kumar Naik

Akash Kumar Naik

Maintained by Community

Actor stats

1

Bookmarked

50

Total users

4

Monthly active users

15 days ago

Last modified

Share

PDF Text Extractor β€” Extract Text from Any PDF File

Apify Actor

PDF Text Extractor is an Apify Actor that extracts text from PDF files. It supports direct URLs and Google Drive / Dropbox / OneDrive share links. For PDFs that have no embedded text (scanned documents, image-based PDFs), an optional OCR fallback powered by Tesseract renders each image-only page and reads the text from the rendered image.


πŸš€ Key Features

  • Direct URL support β€” Fetch and extract text from any publicly accessible PDF URL
  • Google Drive β€” Auto-converts Google Drive share links to direct download URLs
  • Dropbox & OneDrive β€” Converts Dropbox and OneDrive share links automatically
  • OCR fallback β€” Tesseract OCR on pages with no embedded text (scanned / image PDFs)
  • Multi-language OCR β€” 100+ languages supported via Tesseract trained data
  • Page limiting β€” Optionally cap extraction at a specific number of pages
  • Page labels in output β€” When maxPages is set, each page's text is prefixed with [Page N]
  • Retry logic β€” Up to 3 attempts with 1.5Γ— exponential backoff on network failures
  • Rotating user agents β€” Uses randomised browser user-agent headers to avoid blocks
  • Pay-per-event pricing β€” Pay only for successful extractions
  • Structured JSON output β€” Extracted text plus metadata (page count, file size, source type, OCR flag, timestamp)

πŸ“₯ Input Parameters

ParameterTypeRequiredDefaultDescription
pdfUrlstringβœ… Yesβ€”URL of the PDF. Accepts direct URLs, or Google Drive / Dropbox / OneDrive share links.
maxPagesintegerNo0Maximum pages to extract. 0 = all pages. Range: 0–10 000. When > 0, each page's text is prefixed with [Page N].
ocrFallbackbooleanNofalseEnable OCR for pages with fewer than 50 characters of embedded text. Required for scanned or image-based PDFs. Increases processing time and memory usage.
ocrLanguagestringNoengTesseract language code. Only used when ocrFallback is true. See supported languages.

Example Input β€” Standard PDF

{
"pdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing",
"maxPages": 0
}

Example Input β€” Scanned / Image PDF with OCR

{
"pdfUrl": "https://example.com/scanned-document.pdf",
"ocrFallback": true,
"ocrLanguage": "eng"
}

πŸ“€ Output Format

Each run pushes one item to the default dataset.

Successful extraction (text layer)

{
"originalPdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing",
"processedPdfUrl": "https://drive.google.com/uc?export=download&id=FILE_ID",
"extractedText": "Full text content extracted from the PDF...",
"pageCount": 12,
"extractedPages": 12,
"fileSizeBytes": 1048576,
"sourceType": "google-drive",
"ocrApplied": false,
"timestamp": "2026-02-18T07:00:00.000Z",
"success": true
}

Successful extraction (OCR applied to some pages)

{
"originalPdfUrl": "https://example.com/scanned.pdf",
"processedPdfUrl": "https://example.com/scanned.pdf",
"extractedText": "Text recovered by OCR from the scanned pages...",
"pageCount": 5,
"extractedPages": 5,
"fileSizeBytes": 2097152,
"sourceType": "direct-url",
"ocrApplied": true,
"timestamp": "2026-02-18T07:00:00.000Z",
"success": true
}

Failed extraction

{
"originalPdfUrl": "https://example.com/missing.pdf",
"processedPdfUrl": "https://example.com/missing.pdf",
"extractedText": "",
"pageCount": 0,
"extractedPages": 0,
"fileSizeBytes": 0,
"sourceType": "direct-url",
"ocrApplied": false,
"timestamp": "2026-02-18T07:00:00.000Z",
"success": false,
"errorMessage": "HTTP 404: Not Found"
}

Output field reference

FieldTypeDescription
originalPdfUrlstringThe input URL exactly as provided
processedPdfUrlstringThe URL actually used to download the file (after cloud-link conversion)
extractedTextstringFull extracted text. When maxPages > 0, each page is prefixed with [Page N]
pageCountintegerTotal number of pages in the PDF
extractedPagesintegerPages actually extracted (≀ pageCount when maxPages is set)
fileSizeBytesintegerDownloaded file size in bytes
sourceTypestringOne of: direct-url, google-drive, dropbox, onedrive
ocrAppliedbooleantrue if OCR was used on at least one page
timestampstringISO 8601 timestamp of when extraction completed
successbooleantrue if extraction succeeded
errorMessagestringPresent only on failure; describes the error

πŸ” How Text Extraction Works

Step 1 β€” Text layer extraction (always)

pdf.js-extract parses the PDF and reads the embedded text layer from every page. This is fast and accurate for digitally-created PDFs.

Step 2 β€” OCR fallback (optional, ocrFallback: true)

Any page that returns fewer than 50 characters of embedded text is treated as an image-only page. The Actor:

  1. Renders that page to a PNG image at 300 DPI using pdftoppm (from poppler-utils)
  2. Runs Tesseract OCR (tesseract.js) on the image
  3. Replaces the page's text with the OCR result if it yields more characters

This hybrid approach means both regular and scanned PDFs are handled in a single run. Pages that already have good embedded text are never sent to OCR, keeping processing fast.

OCR accuracy notes:

  • Works best on clean, high-contrast scans
  • Handwritten text accuracy varies by quality
  • Complex multi-column layouts may have word-ordering issues
  • Non-Latin scripts require the matching ocrLanguage code

🌐 Supported OCR Language Codes

Pass any Tesseract language code as ocrLanguage:

LanguageCode
Englisheng
Frenchfra
Germandeu
Spanishspa
Portuguesepor
Italianita
Chinese (Simplified)chi_sim
Chinese (Traditional)chi_tra
Japanesejpn
Koreankor
Arabicara
Hindihin
Russianrus

SourceExample
Direct URLhttps://example.com/document.pdf
Google Drivehttps://drive.google.com/file/d/FILE_ID/view
Google Drive (open)https://drive.google.com/open?id=FILE_ID
Dropboxhttps://www.dropbox.com/s/HASH/filename.pdf
OneDrive (1drv.ms)https://1drv.ms/b/s!SHARE_ID

Cloud share links are automatically converted to direct download URLs before fetching.


πŸ’° Pricing

Pay-per-event pricing β€” charged only on successful extractions.

EventPriceTrigger
pdf-processed$0.005Per successfully processed PDF
page-extracted$0.0005Per page extracted (only when extractedPages > 1)

Cost examples

PDFCost
1-page PDF$0.0050
3-page PDF$0.0065 ($0.005 + 3 Γ— $0.0005)
10-page PDF$0.0100 ($0.005 + 10 Γ— $0.0005)

Failed extractions are not charged. Spending limits can be controlled via ACTOR_MAX_TOTAL_CHARGE_USD.


βš™οΈ Technical Details

PropertyValue
RuntimeNode.js 20
Actor version1.4
Memory256 MB min β€” 512 MB max
Download timeout120 seconds (fixed)
Max pages input0–10 000 (0 = all pages)
PDF librarypdf.js-extract ^0.2.1
OCR enginetesseract.js ^5.1.0 (LSTM mode)
PDF-to-imagepdftoppm from poppler-utils at 300 DPI
OCR thresholdPages with < 50 embedded chars
HTTP clientnode-fetch ^2.7.0
Retry attempts3 (1.5Γ— exponential backoff, 1.5 s base delay)
Output formatJSON dataset

πŸ”§ Use Cases

  • Document processing β€” Invoices, contracts, reports, forms (including scanned paper copies)
  • Research β€” Extract text from academic papers, white papers, and archival PDFs
  • Data pipelines β€” Feed PDF content into downstream NLP or search systems
  • Content management β€” Index PDF archives for full-text search
  • Automation β€” Process PDFs at scale via the Apify API or Zapier/Make integrations
  • Historical documents β€” OCR old scanned records and books

🌐 Integration

PlatformDetails
Apify APIFull REST API access
Apify SDK (Python / Node.js)Official SDKs supported
ZapierConnect with 5 000+ apps
Make (Integromat)Visual workflow automation
WebhooksReal-time completion notifications

πŸ”’ Security & Privacy

  • Processing runs inside Apify's secure cloud infrastructure
  • Data is not persisted beyond the Actor run's dataset retention period
  • All transfers use HTTPS
  • Spending limits enforced via Apify's pay-per-event system

πŸ› οΈ Local Development

# Install dependencies
npm install
# Run Actor locally (reads from input.json)
apify run --input-file input.json
# Validate input/output schemas
apify validate-schema
# Deploy to Apify platform
apify push
# Run on the platform
apify call

Local run storage is written to ./storage/ (git-ignored).

Note for local OCR testing: pdftoppm must be installed locally. The Apify platform uses Alpine Linux and installs it via apk add poppler-utils automatically. For local dev: brew install poppler (macOS), apk add poppler-utils (Alpine), apt-get install poppler-utils (Debian/Ubuntu), or choco install poppler (Windows).


πŸ“– Quick Start

  1. Open the Actor on Apify Store
  2. Click Try for free
  3. Enter a PDF URL in the pdfUrl field
  4. For scanned PDFs, toggle Enable OCR Fallback to true
  5. Click Start and wait for the run to finish
  6. Download extracted text from the Dataset tab in JSON, CSV, or XLSX format

Ready to extract text from any PDF? Start using PDF Text Extractor β†’