PDF Text Extractor API - URL to Text, Per-Page, Batch avatar

PDF Text Extractor API - URL to Text, Per-Page, Batch

Pricing

from $2.00 / 1,000 page extracteds

Go to Apify Store
PDF Text Extractor API - URL to Text, Per-Page, Batch

PDF Text Extractor API - URL to Text, Per-Page, Batch

Turn any public PDF URL into clean text and metadata. Per-page output, batch processing, and a synchronous API mode for AI agents. Pay per page extracted, cheaper than the alternatives.

Pricing

from $2.00 / 1,000 page extracteds

Rating

0.0

(0)

Developer

Jimmy A

Jimmy A

Maintained by Community

Actor stats

0

Bookmarked

1

Total users

0

Monthly active users

a day ago

Last modified

Share

Give it public PDF URLs, get back clean text and document metadata. One block per page or per document, batch-capable, and callable as a synchronous API so AI agents and automations can extract PDFs on demand.

No OCR needed for digital PDFs, no upload step, no key. Pay per page extracted - cheaper than comparable actors charging $0.022-0.04 per page.

What it does

  1. Fetches each PDF URL (redirects followed, 60s timeout)
  2. Extracts text page by page with line reconstruction (not one giant word soup)
  3. Reads the document's own metadata (title, author, producer, dates) as published in the file
  4. Outputs one structured record per document, with per-page text blocks if you want them

Use cases

  • RAG / AI pipelines: turn report URLs into chunks for embedding, page-aligned
  • Agents: call the standby endpoint as a tool - "read this PDF and answer"
  • Document monitoring: pair with a scheduler to extract recurring reports (filings, government publications, price lists)
  • Data entry automation: pull text from invoices, spec sheets, catalogs you have rights to process
  • Research: batch-extract paper PDFs into searchable text

Input

{
"pdfUrls": [
"https://arxiv.org/pdf/1706.03762",
"https://example.com/annual-report.pdf"
],
"perPage": true,
"maxPages": 500
}

Output

{
"url": "https://arxiv.org/pdf/1706.03762",
"pageCount": 15,
"pagesExtracted": 15,
"truncated": false,
"metadata": { "title": null, "author": null, "producer": "pdfTeX", "creationDate": "..." },
"pages": [
{ "page": 1, "text": "Attention Is All You Need\n..." }
]
}

Set perPage: false for a single text field per document. Failed URLs produce a record with an error field instead of killing the run.

API / Standby mode for AI agents

GET /?url=https://example.com/file.pdf&perPage=true&maxPages=50

Returns the full extraction JSON synchronously. Works as a tool for agent frameworks that support Apify actors.

Pricing

EventPrice
Actor start$0.0005
Per page extracted$0.002
API call (standby)$0.02

A 40-page report costs $0.08. Comparable actors charge $0.022-0.04 per page - 10-20x more.

FAQ

Does it do OCR on scanned PDFs? Not in this version. It extracts the text layer of digital PDFs (the overwhelming majority of reports, papers, and filings). Scanned-image PDFs return empty pages; an OCR tier is planned - ask in Issues if you need it.

How are lines handled? Text items are regrouped by their position on the page, so paragraphs read naturally instead of being one long line.

Maximum size? Default cap is 500 pages per document (configurable). Very large files are limited by fetch timeout (60s).

Password-protected PDFs? Not supported. Public, unencrypted documents only.

CSV/Excel export? Every Apify dataset exports as JSON, CSV, or Excel via the platform.