Pricing

from $5.20 / 1,000 page extracteds

PDF to Text Extractor — Native Text & Metadata

Extract text, page metadata, outlines, links, and document info from PDFs with page-level output and automation-friendly exports.

Pricing

from $5.20 / 1,000 page extracteds

Rating

0.0

(0)

Developer

junipr

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

Features

Text extraction from text-based PDFs using the proven pdf-parse library
Metadata extraction — title, author, subject, creator, producer, creation date, modification date, and PDF version
Page-by-page output — get individual page text and character counts instead of one combined blob
Multiple output formats — plain text, markdown (paragraph-structured), or full JSON
Batch processing — provide many PDF URLs and process them concurrently (up to 10 at once)
Max pages limit — extract only the first N pages for cost control on large documents
Progress logging — detailed logs for each PDF: download size, parse status, page count
Error resilience — per-PDF error capture so one bad PDF doesn't abort the batch
Zero-config — runs immediately with the bounded W3C default PDF, no setup required

Input

Field	Type	Default	Description
`pdfUrls`	array	W3C default PDF	List of `{ url, label? }` objects to process
`outputFormat`	string	`"text"`	`text`, `markdown`, or `json`
`extractMetadata`	boolean	`true`	Extract PDF metadata (title, author, dates, etc.)
`pageByPage`	boolean	`false`	Output each page separately with character counts
`maxPages`	integer	`0` (all)	Max pages per PDF (0 = no limit)
`maxConcurrency`	integer	`3`	Simultaneous PDFs (1–10)
`requestTimeout`	integer	`60000`	Download timeout in milliseconds

Input Example

{
  "pdfUrls": [
    { "url": "https://example.com/report.pdf", "label": "annual-report" },
    { "url": "https://example.com/manual.pdf", "label": "user-manual" }
  ],
  "outputFormat": "text",
  "extractMetadata": true,
  "pageByPage": true,
  "maxPages": 50,
  "maxConcurrency": 5,
  "requestTimeout": 90000
}

Output

Each processed PDF produces one dataset item. Results are available as JSON/CSV via the Apify dataset API.

Output Example

{
  "url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
  "label": "annual-report",
  "fileName": "dummy.pdf",
  "metadata": {
    "title": "Dummy PDF",
    "author": null,
    "subject": null,
    "creator": "Writer",
    "producer": "LibreOffice 3.3",
    "creationDate": "D:20100909004945-07'00'",
    "modDate": null,
    "pdfVersion": "1.4"
  },
  "text": "Dummy PDF file\n\nThis is a dummy PDF...",
  "pageCount": 1,
  "pages": [
    {
      "pageNumber": 1,
      "text": "Dummy PDF file\n\nThis is a dummy PDF...",
      "charCount": 247
    }
  ],
  "extractedAt": "2025-01-01T12:00:00.000Z",
  "errors": []
}

Cost

Pricing is pay-per-page-extracted at $5.20 per 1,000 pages extracted in the queued monetization update. You are only charged for pages that are successfully extracted — failed downloads and parse errors are free.

Usage	Estimated Cost
1,000 pages	$5.20
10,000 pages	$52.00
100,000 pages	$520.00

Use the maxPages setting to cap extraction per PDF and control costs on large documents.

Limitations

Text-based PDFs only — scanned/image PDFs require OCR and are not supported by this actor. Text extraction will return empty strings for image-only pages.
No password-protected PDFs — encrypted PDFs that require a password are not supported.
URL access required — PDFs must be publicly accessible via HTTP/HTTPS. PDFs behind login walls or requiring cookies will fail to download.
Memory — very large PDFs (500+ pages, 100MB+) may require more than the default 2 GB memory. Increase the memory limit in the run options if you encounter out-of-memory errors.
No OCR fallback — if you need to extract text from scanned PDFs, consider pairing this actor with an OCR service.

Use Cases

RAG / LLM pipelines — extract clean text from documents for embedding and retrieval
Document search — build searchable indexes from PDF libraries
Data extraction — pull structured content from reports, manuals, and whitepapers
Compliance and archival — convert PDFs to plain text for long-term storage and auditing
Batch processing — process hundreds of PDFs concurrently with a single actor run

Competitive Advantage vs Other Extractors

The leading PDF extractor on Apify Store (928+ users) extracts text but provides no metadata, no page-level output, and no progress logging. This actor adds:

Full metadata — title, author, dates, PDF version, and creator information
Page-by-page output — get individual pages with character counts, ideal for chunked LLM ingestion
Structured JSON — every result is a typed dataset item, not a raw text blob
Progress logs — know exactly which PDFs succeeded, how many pages were extracted, and what failed
Multiple output formats — plain text, markdown-structured, or full JSON with metadata embedded

PDF to HTML Converter — Convert PDF documents to semantic HTML with heading detection, table extraction, and image support
RAG Web Extractor — Extract clean, chunked text from web pages for LLM pipelines
Website to RSS — Turn any website into an RSS feed for monitoring and automation

FAQ

Does this work on scanned PDFs?

No. This actor uses text extraction from the PDF content stream. Scanned PDFs are essentially images embedded in a PDF container — there is no text layer to extract. If your PDFs are scanned, you need an OCR solution.

Can I process PDFs from Google Drive or Dropbox?

Only if the PDF is served as a direct public download URL (e.g., a shared link with dl=1 for Dropbox). Links that redirect to a preview page won't work. Use the direct download URL format for your cloud storage provider.

What happens if one PDF in my batch fails?

The actor continues processing the remaining PDFs. The failed PDF will have an empty text field and a non-empty errors array in the dataset. Successful PDFs are unaffected.

How do I use the pageByPage output for LLM chunking?

Set pageByPage: true and each dataset item will include a pages array where every element has pageNumber, text, and charCount. You can further filter or chunk pages in your downstream pipeline based on character count.

What is the outputFormat: "markdown" option?

Markdown mode normalizes the extracted text into paragraph-separated blocks (double newlines between paragraphs). It does not add headers, bullets, or tables — the PDF's raw text doesn't contain enough structure for reliable markdown formatting. For heading detection and rich HTML, use the PDF to HTML Converter instead.

PDF Text Extractor — Text & Metadata from URLs

darknezz/pdf-text-extractor

Extract clean text and metadata from any PDF by URL: full text, page count, title, author, dates as JSON. Perfect for AI pipelines, RAG ingestion, document search and content analysis. No API key needed.

Oaida Adrian

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

NanoScrape

PDF Text Extractor

automation-lab/pdf-text-extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Stas Persiianenko

170

PDF Text Extractor - Extract Text from PDF by URL API

eliai/pdf-text-extractor

Extract text from PDF by URL. Input: url of a PDF. Output: JSON with full extracted text, page count, and document metadata (title, author, dates). Built for RAG pipelines, document QA, and agents. Pay-per-result at $0.05 per PDF processed.

Anthony Snider

PDF Text Extractor — PDF to Clean Text JSON

omao/pdf-text

Extract clean, structured text from any PDF by URL, page by page. Returns one row per page with de-hyphenated, whitespace-normalized text. Fast, no setup.

Marouane Oulabass

📄 PDF Text Extractor

api-empire/pdf-text-extractor

📄 PDF Text Extractor effortlessly converts PDF files into searchable text and clean output. ⚡ Fast, accurate, and user-friendly—ideal for document analysis, data extraction, and content indexing. 🚀 Perfect for research, compliance, and automation.

API Empire

📄 PDF Text Extractor

scrapier/pdf-text-extractor

📄✨ PDF Text Extractor converts PDFs to clean, searchable text in seconds. Extract content for SEO, research, data entry & document processing—fast, accurate, and easy to use. 🚀 Perfect for analysts, developers & teams handling PDFs.

Scrapier

📄 PDF Text Extractor

scrapio/pdf-text-extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.