PDF Parser API avatar

PDF Parser API

Pricing

Pay per usage

Go to Apify Store
PDF Parser API

PDF Parser API

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

George Kioko

George Kioko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

8 hours ago

Last modified

Categories

Share

PDF Parser API - Extract Text & Metadata from PDF Files

A fast, reliable PDF parser API that extracts text content, metadata, page count, and word count from any publicly accessible PDF file. Simply provide a PDF URL and get back structured JSON with the full text and document properties -- perfect for RAG pipelines, document processing, and AI training data preparation.

Built as an always-on Standby API on Apify, it responds instantly with no cold starts, no queues, and no SDK required.

Key Features

  • Full text extraction -- get every word from any PDF, ready for indexing or NLP
  • Rich metadata -- title, author, subject, creator, producer, creation/modification dates
  • Page & word counts -- instant document statistics without downloading the file yourself
  • PDF version detection -- know exactly what PDF spec the document uses
  • GET and POST endpoints -- use query parameters or JSON body, your choice
  • CORS enabled -- call directly from browser-based apps
  • Magic-byte validation -- rejects non-PDF files before wasting parse time
  • Password-protected detection -- returns a clear error instead of crashing
  • Streaming size guard -- enforces the 50 MB limit even when Content-Length is missing

How It Works

flowchart LR
A["Client\n(curl / Python / JS)"] -->|HTTP GET or POST\nwith PDF URL| B["PDF Parser API\n(Apify Standby)"]
B -->|Download PDF| C["Remote PDF\nServer"]
C -->|PDF binary| B
B -->|pdf-parse\nprocessing| D["Extracted Data"]
D -->|JSON response| A
style A fill:#e8f4fd,stroke:#2196F3
style B fill:#fff3e0,stroke:#FF9800
style C fill:#f3e5f5,stroke:#9C27B0
style D fill:#e8f5e9,stroke:#4CAF50

Endpoints

MethodPathDescription
GET/parse?url=<pdf_url>Parse a PDF by passing the URL as a query parameter
POST/parseParse a PDF by sending {"url": "<pdf_url>"} as JSON body
GET/healthHealth check -- returns {"status": "ok"}
GET/Service info with usage instructions

Input

GET request

Pass the PDF URL as a query parameter:

GET /parse?url=https://example.com/document.pdf

POST request

Send a JSON body with the url field:

{
"url": "https://example.com/document.pdf"
}

Output

A successful response returns structured JSON:

{
"success": true,
"pages": 12,
"text": "Full extracted text content of the PDF document...",
"metadata": {
"title": "Annual Report 2025",
"author": "Jane Smith",
"subject": "Financial Summary",
"creator": "Microsoft Word",
"producer": "macOS Quartz PDFContext",
"creationDate": "D:20250115102030Z",
"modDate": "D:20250120083000Z"
},
"pdfVersion": "1.7",
"textLength": 48320,
"wordCount": 7841,
"processingTimeMs": 342
}

Error response

{
"success": false,
"error": "PDF is password-protected and cannot be parsed."
}

How to Use

Using curl (GET)

$curl "https://pdf-parser-api.apify.actor/parse?url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"

Using curl (POST)

curl -X POST "https://pdf-parser-api.apify.actor/parse" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}'

Health check

$curl "https://pdf-parser-api.apify.actor/health"

Integration Examples

Python

import requests
response = requests.get(
"https://pdf-parser-api.apify.actor/parse",
params={"url": "https://example.com/report.pdf"}
)
data = response.json()
print(f"Pages: {data['pages']}")
print(f"Words: {data['wordCount']}")
print(f"Title: {data['metadata']['title']}")
print(f"Text preview: {data['text'][:500]}")

Node.js

const response = await fetch("https://pdf-parser-api.apify.actor/parse", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://example.com/report.pdf",
}),
});
const data = await response.json();
console.log(`Pages: ${data.pages}`);
console.log(`Words: ${data.wordCount}`);
console.log(`Text preview: ${data.text.slice(0, 500)}`);

RAG Pipeline (Python + LangChain)

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Extract text from PDF
resp = requests.get(
"https://pdf-parser-api.apify.actor/parse",
params={"url": "https://example.com/knowledge-base.pdf"}
)
pdf_data = resp.json()
# Chunk for vector store
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_text(pdf_data["text"])
# Each chunk is ready for embedding and indexing
print(f"Split {pdf_data['wordCount']} words into {len(chunks)} chunks")

Use Cases

  • RAG pipelines -- extract text from PDFs and chunk it for vector databases (Pinecone, Weaviate, Chroma)
  • Document processing -- batch-process invoices, contracts, and reports into structured data
  • AI training data -- convert PDF corpora into clean text for fine-tuning language models
  • Legal & compliance -- parse regulatory filings, court documents, and compliance reports at scale
  • Academic research -- extract text from research papers for citation analysis or literature reviews
  • Content migration -- pull text from legacy PDF archives into modern CMS platforms
  • Search indexing -- feed PDF content into Elasticsearch, Algolia, or Meilisearch

Pricing

EventCost
PDF parsed successfully$0.004 per PDF

You only pay when a PDF is successfully parsed. Failed requests (invalid URL, timeout, password-protected files) are not charged.

Limitations

ConstraintLimit
Maximum file size50 MB
Download timeout60 seconds
Request body size1 MB (for POST requests)
Scanned PDFsNo OCR -- only digitally created PDFs with embedded text are supported
Password-protected PDFsNot supported -- returns a clear error message
ProtocolsHTTP and HTTPS only -- no local file paths or FTP

FAQ

Does this API support scanned PDFs or images inside PDFs?

No. This API extracts embedded text from digitally created PDFs. If a PDF was created by scanning paper documents and contains only images, the extracted text will be empty or minimal. For scanned PDFs, you would need an OCR service as a preprocessing step.

What happens if the PDF is too large or the download times out?

The API enforces a 50 MB file size limit and a 60-second download timeout. If either limit is exceeded, you will receive a clear error response with the appropriate HTTP status code (413 for size, 408 for timeout). You are not charged for failed requests.

Can I parse PDFs that require authentication or are behind a login?

The API fetches PDFs from the URL you provide using a standard HTTP request. If the PDF requires cookies, authentication headers, or is behind a login wall, the download will likely fail. The PDF must be publicly accessible or accessible via a direct URL with any required tokens embedded in the query string.

What metadata fields are extracted?

The API extracts seven metadata fields when available: title, author, subject, creator (the application that created the document), producer (the PDF library used), creation date, and modification date. Not all PDFs contain all metadata fields -- missing fields are returned as null.


Built by George The Developer on Apify.