PDF to Text Extractor avatar

PDF to Text Extractor

Pricing

from $1.00 / 1,000 page extracteds

Go to Apify Store
PDF to Text Extractor

PDF to Text Extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

Pricing

from $1.00 / 1,000 page extracteds

Rating

0.0

(0)

Developer

junipr

junipr

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Extract clean text from PDF files — with full metadata, optional page-by-page output, and multiple output formats. Process batches of PDFs by URL with configurable concurrency, progress logging, and structured JSON results.


Features

  • Text extraction from text-based PDFs using the proven pdf-parse library
  • Metadata extraction — title, author, subject, creator, producer, creation date, modification date, and PDF version
  • Page-by-page output — get individual page text and character counts instead of one combined blob
  • Multiple output formats — plain text, markdown (paragraph-structured), or full JSON
  • Batch processing — provide many PDF URLs and process them concurrently (up to 10 at once)
  • Max pages limit — extract only the first N pages for cost control on large documents
  • Progress logging — detailed logs for each PDF: download size, parse status, page count
  • Error resilience — per-PDF error capture so one bad PDF doesn't abort the batch
  • Zero-config — runs immediately with the default W3C sample PDF, no setup required

Input

FieldTypeDefaultDescription
pdfUrlsarrayW3C sampleList of { url, label? } objects to process
outputFormatstring"text"text, markdown, or json
extractMetadatabooleantrueExtract PDF metadata (title, author, dates, etc.)
pageByPagebooleanfalseOutput each page separately with character counts
maxPagesinteger0 (all)Max pages per PDF (0 = no limit)
maxConcurrencyinteger3Simultaneous PDFs (1–10)
requestTimeoutinteger60000Download timeout in milliseconds

Input Example

{
"pdfUrls": [
{ "url": "https://example.com/report.pdf", "label": "annual-report" },
{ "url": "https://example.com/manual.pdf", "label": "user-manual" }
],
"outputFormat": "text",
"extractMetadata": true,
"pageByPage": true,
"maxPages": 50,
"maxConcurrency": 5,
"requestTimeout": 90000
}

Output

Each processed PDF produces one dataset item. Results are available as JSON/CSV via the Apify dataset API.

Output Example

{
"url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"label": "sample",
"fileName": "dummy.pdf",
"metadata": {
"title": "Dummy PDF",
"author": null,
"subject": null,
"creator": "Writer",
"producer": "LibreOffice 3.3",
"creationDate": "D:20100909004945-07'00'",
"modDate": null,
"pdfVersion": "1.4"
},
"text": "Dummy PDF file\n\nThis is a dummy PDF...",
"pageCount": 1,
"pages": [
{
"pageNumber": 1,
"text": "Dummy PDF file\n\nThis is a dummy PDF...",
"charCount": 247
}
],
"extractedAt": "2025-01-01T12:00:00.000Z",
"errors": []
}

Cost

Pricing is pay-per-page-extracted. You are only charged for pages that are successfully extracted — failed downloads and parse errors are free.

UsageEstimated Cost
1,000 pages~$0.003
10,000 pages~$0.03
100,000 pages~$0.30

Use the maxPages setting to cap extraction per PDF and control costs on large documents.


Limitations

  • Text-based PDFs only — scanned/image PDFs require OCR and are not supported by this actor. Text extraction will return empty strings for image-only pages.
  • No password-protected PDFs — encrypted PDFs that require a password are not supported.
  • URL access required — PDFs must be publicly accessible via HTTP/HTTPS. PDFs behind login walls or requiring cookies will fail to download.
  • Memory — very large PDFs (500+ pages, 100MB+) may require more than the default 2 GB memory. Increase the memory limit in the run options if you encounter out-of-memory errors.
  • No OCR fallback — if you need to extract text from scanned PDFs, consider pairing this actor with an OCR service.

Use Cases

  • RAG / LLM pipelines — extract clean text from documents for embedding and retrieval
  • Document search — build searchable indexes from PDF libraries
  • Data extraction — pull structured content from reports, manuals, and whitepapers
  • Compliance and archival — convert PDFs to plain text for long-term storage and auditing
  • Batch processing — process hundreds of PDFs concurrently with a single actor run

Competitive Advantage vs Other Extractors

The leading PDF extractor on Apify Store (928+ users) extracts text but provides no metadata, no page-level output, and no progress logging. This actor adds:

  • Full metadata — title, author, dates, PDF version, and creator information
  • Page-by-page output — get individual pages with character counts, ideal for chunked LLM ingestion
  • Structured JSON — every result is a typed dataset item, not a raw text blob
  • Progress logs — know exactly which PDFs succeeded, how many pages were extracted, and what failed
  • Multiple output formats — plain text, markdown-structured, or full JSON with metadata embedded

  • PDF to HTML Converter — Convert PDF documents to semantic HTML with heading detection, table extraction, and image support
  • RAG Web Extractor — Extract clean, chunked text from web pages for LLM pipelines
  • Website to RSS — Turn any website into an RSS feed for monitoring and automation

FAQ

Does this work on scanned PDFs?

No. This actor uses text extraction from the PDF content stream. Scanned PDFs are essentially images embedded in a PDF container — there is no text layer to extract. If your PDFs are scanned, you need an OCR solution.

Can I process PDFs from Google Drive or Dropbox?

Only if the PDF is served as a direct public download URL (e.g., a shared link with dl=1 for Dropbox). Links that redirect to a preview page won't work. Use the direct download URL format for your cloud storage provider.

What happens if one PDF in my batch fails?

The actor continues processing the remaining PDFs. The failed PDF will have an empty text field and a non-empty errors array in the dataset. Successful PDFs are unaffected.

How do I use the pageByPage output for LLM chunking?

Set pageByPage: true and each dataset item will include a pages array where every element has pageNumber, text, and charCount. You can further filter or chunk pages in your downstream pipeline based on character count.

What is the outputFormat: "markdown" option?

Markdown mode normalizes the extracted text into paragraph-separated blocks (double newlines between paragraphs). It does not add headers, bullets, or tables — the PDF's raw text doesn't contain enough structure for reliable markdown formatting. For heading detection and rich HTML, use the PDF to HTML Converter instead.