PDF Text Extractor
Pricing
Pay per event
PDF Text Extractor
Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Stas Persiianenko
Actor stats
0
Bookmarked
6
Total users
4
Monthly active users
3 days ago
Last modified
Categories
Share
What does PDF Text Extractor do?
PDF Text Extractor downloads PDF files from any public URL and extracts structured text, metadata, and per-page content. It returns clean JSON with the full document text, individual page text, page count, and all PDF metadata (title, author, creation date, producer, and more).
Unlike browser-based PDF tools, this actor uses pure server-side processing with no browser overhead. It processes PDFs in parallel for maximum throughput and handles errors gracefully -- if one PDF fails, the rest still complete.
Try it now on the Apify Store with the prefilled example URLs.
Who is PDF Text Extractor for?
AI/ML Engineers and Data Scientists
- Extract text from research papers, whitepapers, and technical documentation for RAG pipelines
- Build training datasets from large PDF collections
- Feed document content into LLMs for summarization and analysis
Legal and Compliance Teams
- Extract text from contracts, filings, and regulatory documents
- Build searchable archives from PDF-only document repositories
- Automate document review workflows
Researchers and Academics
- Bulk-extract text from academic papers and journal articles
- Build citation databases from PDF collections
- Convert lecture notes and course materials to searchable text
Developers and Automation Engineers
- Integrate PDF text extraction into data pipelines via API
- Process invoices, receipts, and forms at scale
- Extract metadata for document management systems
Why use PDF Text Extractor?
- Pure server-side processing -- no browser, no proxy, near-zero cost per PDF
- Per-page text extraction -- get text for each individual page, not just the whole document
- Rich metadata -- title, author, subject, keywords, creator, producer, creation/modification dates, PDF version
- Parallel processing -- configure concurrency to process multiple PDFs simultaneously
- Graceful error handling -- failed PDFs don't stop the entire batch
- API access -- integrate with 5,000+ apps via Zapier, Make, and the Apify API
- Scheduled runs -- set up recurring extractions for document monitoring
- Multiple export formats -- JSON, CSV, Excel, XML, HTML
What data can you extract?
| Category | Fields |
|---|---|
| Document text | Full text, per-page text array |
| Metadata | Title, author, subject, keywords |
| Producer info | Creator application, producer application |
| Dates | Creation date, modification date (ISO 8601) |
| Technical | Page count, PDF version, file size in bytes |
| Error handling | Error message (null when successful) |
Each PDF produces one dataset row with 16 structured fields.
How much does it cost to extract text from PDFs?
PDF Text Extractor uses pay-per-event pricing. You only pay for what you use:
| Event | FREE tier | BRONZE | SILVER | GOLD |
|---|---|---|---|---|
| Run started (one-time) | $0.005 | $0.005 | $0.005 | $0.005 |
| Per PDF extracted | $0.00345 | $0.003 | $0.00234 | $0.0018 |
Example costs (BRONZE tier):
- 10 PDFs: $0.005 + 10 x $0.003 = $0.035
- 100 PDFs: $0.005 + 100 x $0.003 = $0.305
- 1,000 PDFs: $0.005 + 1,000 x $0.003 = $3.005
With the free $5 Apify credit, you can extract text from approximately 1,600 PDFs at no cost.
How to extract text from PDF files
- Go to the PDF Text Extractor page on Apify Store
- Click Try for free to open the actor in Apify Console
- Paste your PDF URLs into the PDF URLs field (one per line)
- Adjust concurrency and timeout settings if needed
- Click Start to begin extraction
- Download results in JSON, CSV, or Excel format
Example input
{"urls": ["https://example.com/report-2024.pdf","https://example.com/whitepaper.pdf","https://example.com/invoice-january.pdf"],"includePages": true,"maxConcurrency": 5}
Minimal input
{"urls": ["https://example.com/document.pdf"]}
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
urls | array of strings | (required) | Direct URLs to PDF files |
includePages | boolean | true | Include per-page text breakdown |
maxConcurrency | integer | 5 | Parallel PDF downloads (1-20) |
timeoutPerPdfSecs | integer | 60 | Download timeout per PDF in seconds |
Output example
{"url": "https://www.orimi.com/pdf-test.pdf","fileName": "pdf-test.pdf","title": "PDF Test Page","author": "Yukon Department of Education","subject": null,"keywords": null,"creator": "Acrobat PDFMaker 7.0.7 for Word","producer": "Acrobat Distiller 7.0.5 (Windows)","creationDate": "2008-06-04T15:44:00.000Z","modificationDate": "2008-06-04T15:47:36.000Z","pageCount": 1,"fullText": "PDF Test File Congratulations, your computer is equipped with a PDF reader...","pages": [{"pageNumber": 1,"text": "PDF Test File Congratulations, your computer is equipped with a PDF reader..."}],"pdfVersion": "1.6","fileSizeBytes": 20597,"error": null}
Tips for best results
- Start small -- test with 2-3 PDFs first to verify the URLs work and output meets your needs
- Use direct PDF URLs -- the URL must point directly to a .pdf file, not a page that contains a PDF viewer
- Disable per-page text for large PDFs -- set
includePages: falseto reduce output size when processing documents with hundreds of pages - Increase timeout for large files -- if you are processing PDFs over 50 MB, increase
timeoutPerPdfSecsto 120 or more - Check the error field -- failed PDFs still appear in results with an
errormessage, so you can identify and retry them - Schedule recurring runs -- use Apify's scheduler to automatically extract new PDFs on a daily or weekly basis
Integrations
- PDF Text Extractor + Google Sheets -- automatically populate a spreadsheet with extracted text and metadata from new PDF uploads
- PDF Text Extractor + Slack -- get notified when PDF extraction completes, with a summary of pages processed and any errors
- PDF Text Extractor + Make/Zapier -- trigger PDF extraction when new files are uploaded to Google Drive, Dropbox, or S3
- PDF Text Extractor + OpenAI/LLM -- chain extraction with AI summarization to create document summaries from PDF collections
- Scheduled runs -- monitor a document repository and extract text from newly published PDFs on a schedule
- Webhooks -- trigger downstream processing immediately when extraction completes
Using the Apify API
Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('automation-lab/pdf-text-extractor').call({urls: ['https://example.com/report.pdf','https://example.com/whitepaper.pdf',],includePages: true,});const { items } = await client.dataset(run.defaultDatasetId).listItems();items.forEach(item => {console.log(`${item.fileName}: ${item.pageCount} pages, ${item.fullText.length} chars`);});
Python
from apify_client import ApifyClientclient = ApifyClient('YOUR_APIFY_TOKEN')run = client.actor('automation-lab/pdf-text-extractor').call(run_input={'urls': ['https://example.com/report.pdf','https://example.com/whitepaper.pdf',],'includePages': True,})items = client.dataset(run['defaultDatasetId']).list_items().itemsfor item in items:print(f"{item['fileName']}: {item['pageCount']} pages, {len(item['fullText'])} chars")
cURL
curl -X POST "https://api.apify.com/v2/acts/automation-lab~pdf-text-extractor/runs?token=YOUR_APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"urls": ["https://example.com/report.pdf"],"includePages": true}'
Use with AI agents via MCP
PDF Text Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).
Add the Apify MCP server to your AI client -- this gives you access to all Apify actors, including this one:
Setup for Claude Code
$claude mcp add --transport http apify "https://mcp.apify.com"
Setup for Claude Desktop, Cursor, or VS Code
Add this to your MCP config file:
{"mcpServers": {"apify": {"url": "https://mcp.apify.com"}}}
Your AI assistant will use OAuth to authenticate with your Apify account on first use.
Example prompts
Once connected, try asking your AI assistant:
- "Use automation-lab/pdf-text-extractor to extract all text from this research paper: https://arxiv.org/pdf/1706.03762"
- "Extract metadata and page count from these 5 PDF invoices and summarize the results"
- "Download and extract text from all PDFs linked on this page, then create a summary of each document"
Learn more in the Apify MCP documentation.
Is it legal to extract text from PDFs?
PDF Text Extractor processes publicly accessible PDF files that you provide URLs for. The actor downloads files the same way a web browser would. Always ensure you have the right to access and process the documents you are extracting text from.
For personal data, comply with GDPR and applicable privacy laws. Review the terms of service for any document repositories you are accessing. Apify provides a general web scraping legality guide for reference.
FAQ
How fast is PDF Text Extractor?
Processing speed depends on PDF file size and download speed. A typical 1 MB PDF takes 1-3 seconds to download and parse. With maxConcurrency: 10, you can process 100 average-sized PDFs in under a minute.
How much does it cost to extract text from 1,000 PDFs? At BRONZE tier pricing: $0.005 (start) + 1,000 x $0.003 (per PDF) = $3.005 total. With the free $5 credit, you can process about 1,600 PDFs at no cost.
Does it work with scanned PDFs? No. This actor extracts embedded text from PDFs. Scanned documents that contain only images (no selectable text) will return empty text. For scanned PDFs, you would need an OCR (Optical Character Recognition) solution.
Why are some PDF fields returning null? Not all PDFs include metadata. The title, author, subject, and keywords fields depend on what the PDF creator set when generating the document. Many auto-generated PDFs leave these fields empty.
Why did a PDF fail with "Invalid PDF structure"? The URL may not point to an actual PDF file. Ensure the URL returns a direct PDF download, not an HTML page with an embedded PDF viewer. Some servers also require specific headers or authentication.
Can I extract text from password-protected PDFs? No. Password-protected (encrypted) PDFs cannot be parsed without the password. The actor will return an error for these files.
Other PDF and document tools
- Markdown to PDF Converter -- convert Markdown text into formatted PDF documents
- HTML to PDF Converter -- convert web pages and HTML into PDF files
- Webpage to Markdown Converter -- extract clean Markdown from any webpage
- Fake Test Data Generator -- generate bulk test data in JSON, CSV, or Excel
- Unicode Text Inspector -- analyze text encoding and hidden characters