PDF Text Extractor API - URL to Text, Per-Page, Batch
Pricing
from $2.00 / 1,000 page extracteds
PDF Text Extractor API - URL to Text, Per-Page, Batch
Turn any public PDF URL into clean text and metadata. Per-page output, batch processing, and a synchronous API mode for AI agents. Pay per page extracted, cheaper than the alternatives.
Pricing
from $2.00 / 1,000 page extracteds
Rating
0.0
(0)
Developer
Jimmy A
Maintained by CommunityActor stats
0
Bookmarked
1
Total users
0
Monthly active users
a day ago
Last modified
Categories
Share
Give it public PDF URLs, get back clean text and document metadata. One block per page or per document, batch-capable, and callable as a synchronous API so AI agents and automations can extract PDFs on demand.
No OCR needed for digital PDFs, no upload step, no key. Pay per page extracted - cheaper than comparable actors charging $0.022-0.04 per page.
What it does
- Fetches each PDF URL (redirects followed, 60s timeout)
- Extracts text page by page with line reconstruction (not one giant word soup)
- Reads the document's own metadata (title, author, producer, dates) as published in the file
- Outputs one structured record per document, with per-page text blocks if you want them
Use cases
- RAG / AI pipelines: turn report URLs into chunks for embedding, page-aligned
- Agents: call the standby endpoint as a tool - "read this PDF and answer"
- Document monitoring: pair with a scheduler to extract recurring reports (filings, government publications, price lists)
- Data entry automation: pull text from invoices, spec sheets, catalogs you have rights to process
- Research: batch-extract paper PDFs into searchable text
Input
{"pdfUrls": ["https://arxiv.org/pdf/1706.03762","https://example.com/annual-report.pdf"],"perPage": true,"maxPages": 500}
Output
{"url": "https://arxiv.org/pdf/1706.03762","pageCount": 15,"pagesExtracted": 15,"truncated": false,"metadata": { "title": null, "author": null, "producer": "pdfTeX", "creationDate": "..." },"pages": [{ "page": 1, "text": "Attention Is All You Need\n..." }]}
Set perPage: false for a single text field per document. Failed URLs produce a record with an error field instead of killing the run.
API / Standby mode for AI agents
GET /?url=https://example.com/file.pdf&perPage=true&maxPages=50
Returns the full extraction JSON synchronously. Works as a tool for agent frameworks that support Apify actors.
Pricing
| Event | Price |
|---|---|
| Actor start | $0.0005 |
| Per page extracted | $0.002 |
| API call (standby) | $0.02 |
A 40-page report costs $0.08. Comparable actors charge $0.022-0.04 per page - 10-20x more.
FAQ
Does it do OCR on scanned PDFs? Not in this version. It extracts the text layer of digital PDFs (the overwhelming majority of reports, papers, and filings). Scanned-image PDFs return empty pages; an OCR tier is planned - ask in Issues if you need it.
How are lines handled? Text items are regrouped by their position on the page, so paragraphs read naturally instead of being one long line.
Maximum size? Default cap is 500 pages per document (configurable). Very large files are limited by fetch timeout (60s).
Password-protected PDFs? Not supported. Public, unencrypted documents only.
CSV/Excel export? Every Apify dataset exports as JSON, CSV, or Excel via the platform.