PDF Text Extractor avatar

PDF Text Extractor

Pricing

Pay per event

Go to Apify Store
PDF Text Extractor

PDF Text Extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Stas Persiianenko

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

4

Monthly active users

3 days ago

Last modified

Categories

Share

What does PDF Text Extractor do?

PDF Text Extractor downloads PDF files from any public URL and extracts structured text, metadata, and per-page content. It returns clean JSON with the full document text, individual page text, page count, and all PDF metadata (title, author, creation date, producer, and more).

Unlike browser-based PDF tools, this actor uses pure server-side processing with no browser overhead. It processes PDFs in parallel for maximum throughput and handles errors gracefully -- if one PDF fails, the rest still complete.

Try it now on the Apify Store with the prefilled example URLs.

Who is PDF Text Extractor for?

AI/ML Engineers and Data Scientists

  • Extract text from research papers, whitepapers, and technical documentation for RAG pipelines
  • Build training datasets from large PDF collections
  • Feed document content into LLMs for summarization and analysis

Legal and Compliance Teams

  • Extract text from contracts, filings, and regulatory documents
  • Build searchable archives from PDF-only document repositories
  • Automate document review workflows

Researchers and Academics

  • Bulk-extract text from academic papers and journal articles
  • Build citation databases from PDF collections
  • Convert lecture notes and course materials to searchable text

Developers and Automation Engineers

  • Integrate PDF text extraction into data pipelines via API
  • Process invoices, receipts, and forms at scale
  • Extract metadata for document management systems

Why use PDF Text Extractor?

  • Pure server-side processing -- no browser, no proxy, near-zero cost per PDF
  • Per-page text extraction -- get text for each individual page, not just the whole document
  • Rich metadata -- title, author, subject, keywords, creator, producer, creation/modification dates, PDF version
  • Parallel processing -- configure concurrency to process multiple PDFs simultaneously
  • Graceful error handling -- failed PDFs don't stop the entire batch
  • API access -- integrate with 5,000+ apps via Zapier, Make, and the Apify API
  • Scheduled runs -- set up recurring extractions for document monitoring
  • Multiple export formats -- JSON, CSV, Excel, XML, HTML

What data can you extract?

CategoryFields
Document textFull text, per-page text array
MetadataTitle, author, subject, keywords
Producer infoCreator application, producer application
DatesCreation date, modification date (ISO 8601)
TechnicalPage count, PDF version, file size in bytes
Error handlingError message (null when successful)

Each PDF produces one dataset row with 16 structured fields.

How much does it cost to extract text from PDFs?

PDF Text Extractor uses pay-per-event pricing. You only pay for what you use:

EventFREE tierBRONZESILVERGOLD
Run started (one-time)$0.005$0.005$0.005$0.005
Per PDF extracted$0.00345$0.003$0.00234$0.0018

Example costs (BRONZE tier):

  • 10 PDFs: $0.005 + 10 x $0.003 = $0.035
  • 100 PDFs: $0.005 + 100 x $0.003 = $0.305
  • 1,000 PDFs: $0.005 + 1,000 x $0.003 = $3.005

With the free $5 Apify credit, you can extract text from approximately 1,600 PDFs at no cost.

How to extract text from PDF files

  1. Go to the PDF Text Extractor page on Apify Store
  2. Click Try for free to open the actor in Apify Console
  3. Paste your PDF URLs into the PDF URLs field (one per line)
  4. Adjust concurrency and timeout settings if needed
  5. Click Start to begin extraction
  6. Download results in JSON, CSV, or Excel format

Example input

{
"urls": [
"https://example.com/report-2024.pdf",
"https://example.com/whitepaper.pdf",
"https://example.com/invoice-january.pdf"
],
"includePages": true,
"maxConcurrency": 5
}

Minimal input

{
"urls": ["https://example.com/document.pdf"]
}

Input parameters

ParameterTypeDefaultDescription
urlsarray of strings(required)Direct URLs to PDF files
includePagesbooleantrueInclude per-page text breakdown
maxConcurrencyinteger5Parallel PDF downloads (1-20)
timeoutPerPdfSecsinteger60Download timeout per PDF in seconds

Output example

{
"url": "https://www.orimi.com/pdf-test.pdf",
"fileName": "pdf-test.pdf",
"title": "PDF Test Page",
"author": "Yukon Department of Education",
"subject": null,
"keywords": null,
"creator": "Acrobat PDFMaker 7.0.7 for Word",
"producer": "Acrobat Distiller 7.0.5 (Windows)",
"creationDate": "2008-06-04T15:44:00.000Z",
"modificationDate": "2008-06-04T15:47:36.000Z",
"pageCount": 1,
"fullText": "PDF Test File Congratulations, your computer is equipped with a PDF reader...",
"pages": [
{
"pageNumber": 1,
"text": "PDF Test File Congratulations, your computer is equipped with a PDF reader..."
}
],
"pdfVersion": "1.6",
"fileSizeBytes": 20597,
"error": null
}

Tips for best results

  • Start small -- test with 2-3 PDFs first to verify the URLs work and output meets your needs
  • Use direct PDF URLs -- the URL must point directly to a .pdf file, not a page that contains a PDF viewer
  • Disable per-page text for large PDFs -- set includePages: false to reduce output size when processing documents with hundreds of pages
  • Increase timeout for large files -- if you are processing PDFs over 50 MB, increase timeoutPerPdfSecs to 120 or more
  • Check the error field -- failed PDFs still appear in results with an error message, so you can identify and retry them
  • Schedule recurring runs -- use Apify's scheduler to automatically extract new PDFs on a daily or weekly basis

Integrations

  • PDF Text Extractor + Google Sheets -- automatically populate a spreadsheet with extracted text and metadata from new PDF uploads
  • PDF Text Extractor + Slack -- get notified when PDF extraction completes, with a summary of pages processed and any errors
  • PDF Text Extractor + Make/Zapier -- trigger PDF extraction when new files are uploaded to Google Drive, Dropbox, or S3
  • PDF Text Extractor + OpenAI/LLM -- chain extraction with AI summarization to create document summaries from PDF collections
  • Scheduled runs -- monitor a document repository and extract text from newly published PDFs on a schedule
  • Webhooks -- trigger downstream processing immediately when extraction completes

Using the Apify API

Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('automation-lab/pdf-text-extractor').call({
urls: [
'https://example.com/report.pdf',
'https://example.com/whitepaper.pdf',
],
includePages: true,
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
console.log(`${item.fileName}: ${item.pageCount} pages, ${item.fullText.length} chars`);
});

Python

from apify_client import ApifyClient
client = ApifyClient('YOUR_APIFY_TOKEN')
run = client.actor('automation-lab/pdf-text-extractor').call(run_input={
'urls': [
'https://example.com/report.pdf',
'https://example.com/whitepaper.pdf',
],
'includePages': True,
})
items = client.dataset(run['defaultDatasetId']).list_items().items
for item in items:
print(f"{item['fileName']}: {item['pageCount']} pages, {len(item['fullText'])} chars")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~pdf-text-extractor/runs?token=YOUR_APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/report.pdf"],
"includePages": true
}'

Use with AI agents via MCP

PDF Text Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client -- this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com"
}
}
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

  • "Use automation-lab/pdf-text-extractor to extract all text from this research paper: https://arxiv.org/pdf/1706.03762"
  • "Extract metadata and page count from these 5 PDF invoices and summarize the results"
  • "Download and extract text from all PDFs linked on this page, then create a summary of each document"

Learn more in the Apify MCP documentation.

PDF Text Extractor processes publicly accessible PDF files that you provide URLs for. The actor downloads files the same way a web browser would. Always ensure you have the right to access and process the documents you are extracting text from.

For personal data, comply with GDPR and applicable privacy laws. Review the terms of service for any document repositories you are accessing. Apify provides a general web scraping legality guide for reference.

FAQ

How fast is PDF Text Extractor? Processing speed depends on PDF file size and download speed. A typical 1 MB PDF takes 1-3 seconds to download and parse. With maxConcurrency: 10, you can process 100 average-sized PDFs in under a minute.

How much does it cost to extract text from 1,000 PDFs? At BRONZE tier pricing: $0.005 (start) + 1,000 x $0.003 (per PDF) = $3.005 total. With the free $5 credit, you can process about 1,600 PDFs at no cost.

Does it work with scanned PDFs? No. This actor extracts embedded text from PDFs. Scanned documents that contain only images (no selectable text) will return empty text. For scanned PDFs, you would need an OCR (Optical Character Recognition) solution.

Why are some PDF fields returning null? Not all PDFs include metadata. The title, author, subject, and keywords fields depend on what the PDF creator set when generating the document. Many auto-generated PDFs leave these fields empty.

Why did a PDF fail with "Invalid PDF structure"? The URL may not point to an actual PDF file. Ensure the URL returns a direct PDF download, not an HTML page with an embedded PDF viewer. Some servers also require specific headers or authentication.

Can I extract text from password-protected PDFs? No. Password-protected (encrypted) PDFs cannot be parsed without the password. The actor will return an error for these files.

Other PDF and document tools