Pricing

Pay per event

PDF Text Extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Stas Persiianenko

Actor stats

Bookmarked

187

Total users

Monthly active users

11 days ago

Last modified

What does PDF Text Extractor do?

PDF Text Extractor downloads PDF files from any public URL and extracts structured text, metadata, and per-page content. It returns clean JSON with the full document text, individual page text, page count, and all PDF metadata (title, author, creation date, producer, and more).

Unlike browser-based PDF tools, this actor uses pure server-side processing with no browser overhead. It processes PDFs in parallel for maximum throughput and handles errors gracefully -- if one PDF fails, the rest still complete.

Try it now on the Apify Store with the prefilled example URLs.

Who is it for?

PDF Text Extractor is built for teams that need reliable PDF text extraction at scale without maintaining their own parsing infrastructure.

Who is PDF Text Extractor for?

AI/ML Engineers and Data Scientists

Extract text from research papers, whitepapers, and technical documentation for RAG pipelines
Build training datasets from large PDF collections
Feed document content into LLMs for summarization and analysis

Legal and Compliance Teams

Extract text from contracts, filings, and regulatory documents
Build searchable archives from PDF-only document repositories
Automate document review workflows

Researchers and Academics

Bulk-extract text from academic papers and journal articles
Build citation databases from PDF collections
Convert lecture notes and course materials to searchable text

Developers and Automation Engineers

Integrate PDF text extraction into data pipelines via API
Process invoices, receipts, and forms at scale
Extract metadata for document management systems

Why use PDF Text Extractor?

Pure server-side processing -- no browser, no proxy, near-zero cost per PDF
Per-page text extraction -- get text for each individual page, not just the whole document
Rich metadata -- title, author, subject, keywords, creator, producer, creation/modification dates, PDF version
Parallel processing -- configure concurrency to process multiple PDFs simultaneously
Graceful error handling -- failed PDFs don't stop the entire batch
API access -- integrate with 5,000+ apps via Zapier, Make, and the Apify API
Scheduled runs -- set up recurring extractions for document monitoring
Multiple export formats -- JSON, CSV, Excel, XML, HTML

What data can you extract?

Category	Fields
Document text	Full text, per-page text array
Metadata	Title, author, subject, keywords
Producer info	Creator application, producer application
Dates	Creation date, modification date (ISO 8601)
Technical	Page count, PDF version, file size in bytes
Error handling	Error message (null when successful)

Each PDF produces one dataset row with 16 structured fields.

How much does it cost to extract text from PDFs?

PDF Text Extractor uses pay-per-event pricing. You only pay for what you use:

Event	FREE tier	BRONZE	SILVER	GOLD
Run started (one-time)	$0.005	$0.005	$0.005	$0.005
Per PDF extracted	$0.00345	$0.003	$0.00234	$0.0018

Example costs (BRONZE tier):

10 PDFs: $0.005 + 10 x $0.003 = $0.035
100 PDFs: $0.005 + 100 x $0.003 = $0.305
1,000 PDFs: $0.005 + 1,000 x $0.003 = $3.005

With the free $5 Apify credit, you can extract text from approximately 1,600 PDFs at no cost.

How to extract text from PDF files

Go to the PDF Text Extractor page on Apify Store
Click Try for free to open the actor in Apify Console
Paste your PDF URLs into the PDF URLs field (one per line)
Adjust concurrency and timeout settings if needed
Click Start to begin extraction
Download results in JSON, CSV, or Excel format

Example input

{
    "urls": [
        "https://example.com/report-2024.pdf",
        "https://example.com/whitepaper.pdf",
        "https://example.com/invoice-january.pdf"
    ],
    "includePages": true,
    "maxConcurrency": 5
}

Minimal input

{
    "urls": ["https://example.com/document.pdf"]
}

Input parameters

Parameter	Type	Default	Description
`urls`	array of strings	(required)	Direct URLs to PDF files
`includePages`	boolean	`true`	Include per-page text breakdown
`maxConcurrency`	integer	`5`	Parallel PDF downloads (1-20)
`timeoutPerPdfSecs`	integer	`60`	Download timeout per PDF in seconds

Output example

{
    "url": "https://www.orimi.com/pdf-test.pdf",
    "fileName": "pdf-test.pdf",
    "title": "PDF Test Page",
    "author": "Yukon Department of Education",
    "subject": null,
    "keywords": null,
    "creator": "Acrobat PDFMaker 7.0.7 for Word",
    "producer": "Acrobat Distiller 7.0.5 (Windows)",
    "creationDate": "2008-06-04T15:44:00.000Z",
    "modificationDate": "2008-06-04T15:47:36.000Z",
    "pageCount": 1,
    "fullText": "PDF Test File  Congratulations, your computer is equipped with a PDF reader...",
    "pages": [
        {
            "pageNumber": 1,
            "text": "PDF Test File  Congratulations, your computer is equipped with a PDF reader..."
        }
    ],
    "pdfVersion": "1.6",
    "fileSizeBytes": 20597,
    "error": null
}

Tips for best results

Start small -- test with 2-3 PDFs first to verify the URLs work and output meets your needs
Use direct PDF URLs -- the URL must point directly to a .pdf file, not a page that contains a PDF viewer
Disable per-page text for large PDFs -- set includePages: false to reduce output size when processing documents with hundreds of pages
Increase timeout for large files -- if you are processing PDFs over 50 MB, increase timeoutPerPdfSecs to 120 or more
Check the error field -- failed PDFs still appear in results with an error message, so you can identify and retry them
Schedule recurring runs -- use Apify's scheduler to automatically extract new PDFs on a daily or weekly basis

Integrations

PDF Text Extractor + Google Sheets -- automatically populate a spreadsheet with extracted text and metadata from new PDF uploads
PDF Text Extractor + Slack -- get notified when PDF extraction completes, with a summary of pages processed and any errors
PDF Text Extractor + Make/Zapier -- trigger PDF extraction when new files are uploaded to Google Drive, Dropbox, or S3
PDF Text Extractor + OpenAI/LLM -- chain extraction with AI summarization to create document summaries from PDF collections
Scheduled runs -- monitor a document repository and extract text from newly published PDFs on a schedule
Webhooks -- trigger downstream processing immediately when extraction completes

API usage

Use PDF Text Extractor from code with the Apify API whenever you need automated PDF text extraction in a data pipeline, RAG workflow, document archive, or scheduled monitoring job.

Using the Apify API

Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('automation-lab/pdf-text-extractor').call({
    urls: [
        'https://example.com/report.pdf',
        'https://example.com/whitepaper.pdf',
    ],
    includePages: true,
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach(item => {
    console.log(`${item.fileName}: ${item.pageCount} pages, ${item.fullText.length} chars`);
});

Python

from apify_client import ApifyClient

client = ApifyClient('YOUR_APIFY_TOKEN')

run = client.actor('automation-lab/pdf-text-extractor').call(run_input={
    'urls': [
        'https://example.com/report.pdf',
        'https://example.com/whitepaper.pdf',
    ],
    'includePages': True,
})

items = client.dataset(run['defaultDatasetId']).list_items().items
for item in items:
    print(f"{item['fileName']}: {item['pageCount']} pages, {len(item['fullText'])} chars")

cURL

curl -X POST "https://api.apify.com/v2/acts/automation-lab~pdf-text-extractor/runs?token=YOUR_APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/report.pdf"],
    "includePages": true
  }'

Use with AI agents via MCP

PDF Text Extractor is available as a tool for AI assistants that support the Model Context Protocol (MCP).

Add the Apify MCP server to your AI client -- this gives you access to all Apify actors, including this one:

Setup for Claude Code

$claude mcp add --transport http apify "https://mcp.apify.com"

Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com"
        }
    }
}

Your AI assistant will use OAuth to authenticate with your Apify account on first use.

Example prompts

Once connected, try asking your AI assistant:

"Use automation-lab/pdf-text-extractor to extract all text from this research paper: https://arxiv.org/pdf/1706.03762"
"Extract metadata and page count from these 5 PDF invoices and summarize the results"
"Download and extract text from all PDFs linked on this page, then create a summary of each document"

Learn more in the Apify MCP documentation.

Is it legal to extract text from PDFs?

PDF Text Extractor processes publicly accessible PDF files that you provide URLs for. The actor downloads files the same way a web browser would. Always ensure you have the right to access and process the documents you are extracting text from.

For personal data, comply with GDPR and applicable privacy laws. Review the terms of service for any document repositories you are accessing. Apify provides a general web scraping legality guide for reference.

FAQ

How fast is PDF Text Extractor? Processing speed depends on PDF file size and download speed. A typical 1 MB PDF takes 1-3 seconds to download and parse. With maxConcurrency: 10, you can process 100 average-sized PDFs in under a minute.

How much does it cost to extract text from 1,000 PDFs? At BRONZE tier pricing: $0.005 (start) + 1,000 x $0.003 (per PDF) = $3.005 total. With the free $5 credit, you can process about 1,600 PDFs at no cost.

Does it work with scanned PDFs? No. This actor extracts embedded text from PDFs. Scanned documents that contain only images (no selectable text) will return empty text. For scanned PDFs, you would need an OCR (Optical Character Recognition) solution.

Why are some PDF fields returning null? Not all PDFs include metadata. The title, author, subject, and keywords fields depend on what the PDF creator set when generating the document. Many auto-generated PDFs leave these fields empty.

Why did a PDF fail with "Invalid PDF structure"? The URL may not point to an actual PDF file. Ensure the URL returns a direct PDF download, not an HTML page with an embedded PDF viewer. Some servers also require specific headers or authentication.

Can I extract text from password-protected PDFs? No. Password-protected (encrypted) PDFs cannot be parsed without the password. The actor will return an error for these files.

Markdown to PDF Converter -- convert Markdown text into formatted PDF documents
HTML to PDF Converter -- convert web pages and HTML into PDF files
Webpage to Markdown Converter -- extract clean Markdown from any webpage
Fake Test Data Generator -- generate bulk test data in JSON, CSV, or Excel
Unicode Text Inspector -- analyze text encoding and hidden characters

PDF Text Extractor — Text & Metadata from URLs

darknezz/pdf-text-extractor

Extract clean text and metadata from any PDF by URL: full text, page count, title, author, dates as JSON. Perfect for AI pipelines, RAG ingestion, document search and content analysis. No API key needed.

Oaida Adrian

PDF Extractor: Structured Text + Metadata

aitoolbreakdown/atb-pdf-extractor

Point it at one or many PDF URLs. Get clean structured JSON back: full text, per-page text, title, author, page count, and word count. Ready for RAG, search, or doc automation.

AI Tool Breakdown

PDF to Text API — Extract PDF Text to Clean JSON for LLM & RAG

omao/pdf-text

Extract clean, structured text from any PDF by URL, page by page. Returns one row per page with de-hyphenated, whitespace-normalized text. Fast, no setup.

Marouane Oulabass

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

NanoScrape

PDF Text Extractor - Extract Text from PDF by URL API

eliai/pdf-text-extractor

Extract text from PDF by URL. Input: url of a PDF. Output: JSON with full extracted text, page count, and document metadata (title, author, dates). Built for RAG pipelines, document QA, and agents. Pay-per-result at $0.05 per PDF processed.

Anthony Snider

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

518

PDF Metadata Extractor - Read PDF Info by URL API

eliai/pdf-metadata-extractor

Extract PDF metadata without pulling full text. Input: url of a PDF. Output: JSON with title, author, subject, keywords, page count, PDF version, creation and modification dates. Fast and lightweight at $0.02 per PDF processed.

Anthony Snider

📄 PDF Text Extractor

scrapio/pdf-text-extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.

Scrapio

PDF Toolkit — Extract Text, Metadata & Page Count

accurate_pouch/pdf-toolkit

Extract text from PDFs, read metadata (title, author, dates), count pages. Bulk processing from URLs. $0.003 per PDF.

Manchitt Sanan

PDF Text Extractor API - URL to Text, Per-Page, Batch

gratifying_graph/pdf-extract-api

Turn any public PDF URL into clean text and metadata. Per-page output, batch processing, and a synchronous API mode for AI agents. Pay per page extracted, cheaper than the alternatives.