PDF Parser API avatar

PDF Parser API

Pricing

from $4.00 / 1,000 pdf-parseds

Go to Apify Store
PDF Parser API

PDF Parser API

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

Pricing

from $4.00 / 1,000 pdf-parseds

Rating

0.0

(0)

Developer

George Kioko

George Kioko

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

13 days ago

Last modified

Categories

Share

PDF Parser API - Extract Text & Metadata from PDF Files

A fast, reliable PDF parser API that extracts text content, metadata, page count, and word count from any publicly accessible PDF file. Simply provide a PDF URL and get back structured JSON with the full text and document properties -- perfect for RAG pipelines, document processing, and AI training data preparation.

Built as an always-on Standby API on Apify, it responds instantly with no cold starts, no queues, and no SDK required.

Key Features

  • Full text extraction -- get every word from any PDF, ready for indexing or NLP
  • Rich metadata -- title, author, subject, creator, producer, creation/modification dates
  • Page & word counts -- instant document statistics without downloading the file yourself
  • PDF version detection -- know exactly what PDF spec the document uses
  • GET and POST endpoints -- use query parameters or JSON body, your choice
  • CORS enabled -- call directly from browser-based apps
  • Magic-byte validation -- rejects non-PDF files before wasting parse time
  • Password-protected detection -- returns a clear error instead of crashing
  • Streaming size guard -- enforces the 50 MB limit even when Content-Length is missing

How It Works

flowchart LR
A["Client\n(curl / Python / JS)"] -->|HTTP GET or POST\nwith PDF URL| B["PDF Parser API\n(Apify Standby)"]
B -->|Download PDF| C["Remote PDF\nServer"]
C -->|PDF binary| B
B -->|pdf-parse\nprocessing| D["Extracted Data"]
D -->|JSON response| A
style A fill:#e8f4fd,stroke:#2196F3
style B fill:#fff3e0,stroke:#FF9800
style C fill:#f3e5f5,stroke:#9C27B0
style D fill:#e8f5e9,stroke:#4CAF50

Endpoints

MethodPathDescription
GET/parse?url=<pdf_url>Parse a PDF by passing the URL as a query parameter
POST/parseParse a PDF by sending {"url": "<pdf_url>"} as JSON body
GET/healthHealth check -- returns {"status": "ok"}
GET/Service info with usage instructions

Input

GET request

Pass the PDF URL as a query parameter:

GET /parse?url=https://example.com/document.pdf

POST request

Send a JSON body with the url field:

{
"url": "https://example.com/document.pdf"
}

Output

A successful response returns structured JSON:

{
"success": true,
"pages": 12,
"text": "Full extracted text content of the PDF document...",
"metadata": {
"title": "Annual Report 2025",
"author": "Jane Smith",
"subject": "Financial Summary",
"creator": "Microsoft Word",
"producer": "macOS Quartz PDFContext",
"creationDate": "D:20250115102030Z",
"modDate": "D:20250120083000Z"
},
"pdfVersion": "1.7",
"textLength": 48320,
"wordCount": 7841,
"processingTimeMs": 342
}

Error response

{
"success": false,
"error": "PDF is password-protected and cannot be parsed."
}

How to Use

Using curl (GET)

$curl "https://pdf-parser-api.apify.actor/parse?url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"

Using curl (POST)

curl -X POST "https://pdf-parser-api.apify.actor/parse" \
-H "Content-Type: application/json" \
-d '{"url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}'

Health check

$curl "https://pdf-parser-api.apify.actor/health"

Integration Examples

Python

import requests
response = requests.get(
"https://pdf-parser-api.apify.actor/parse",
params={"url": "https://example.com/report.pdf"}
)
data = response.json()
print(f"Pages: {data['pages']}")
print(f"Words: {data['wordCount']}")
print(f"Title: {data['metadata']['title']}")
print(f"Text preview: {data['text'][:500]}")

Node.js

const response = await fetch("https://pdf-parser-api.apify.actor/parse", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://example.com/report.pdf",
}),
});
const data = await response.json();
console.log(`Pages: ${data.pages}`);
console.log(`Words: ${data.wordCount}`);
console.log(`Text preview: ${data.text.slice(0, 500)}`);

RAG Pipeline (Python + LangChain)

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Extract text from PDF
resp = requests.get(
"https://pdf-parser-api.apify.actor/parse",
params={"url": "https://example.com/knowledge-base.pdf"}
)
pdf_data = resp.json()
# Chunk for vector store
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_text(pdf_data["text"])
# Each chunk is ready for embedding and indexing
print(f"Split {pdf_data['wordCount']} words into {len(chunks)} chunks")

Use Cases

  • RAG pipelines -- extract text from PDFs and chunk it for vector databases (Pinecone, Weaviate, Chroma)
  • Document processing -- batch-process invoices, contracts, and reports into structured data
  • AI training data -- convert PDF corpora into clean text for fine-tuning language models
  • Legal & compliance -- parse regulatory filings, court documents, and compliance reports at scale
  • Academic research -- extract text from research papers for citation analysis or literature reviews
  • Content migration -- pull text from legacy PDF archives into modern CMS platforms
  • Search indexing -- feed PDF content into Elasticsearch, Algolia, or Meilisearch

Pricing

EventCost
PDF parsed successfully$0.004 per PDF

You only pay when a PDF is successfully parsed. Failed requests (invalid URL, timeout, password-protected files) are not charged.

Limitations

ConstraintLimit
Maximum file size50 MB
Download timeout60 seconds
Request body size1 MB (for POST requests)
Scanned PDFsNo OCR -- only digitally created PDFs with embedded text are supported
Password-protected PDFsNot supported -- returns a clear error message
ProtocolsHTTP and HTTPS only -- no local file paths or FTP

FAQ

Does this API support scanned PDFs or images inside PDFs?

No. This API extracts embedded text from digitally created PDFs. If a PDF was created by scanning paper documents and contains only images, the extracted text will be empty or minimal. For scanned PDFs, you would need an OCR service as a preprocessing step.

What happens if the PDF is too large or the download times out?

The API enforces a 50 MB file size limit and a 60-second download timeout. If either limit is exceeded, you will receive a clear error response with the appropriate HTTP status code (413 for size, 408 for timeout). You are not charged for failed requests.

Can I parse PDFs that require authentication or are behind a login?

The API fetches PDFs from the URL you provide using a standard HTTP request. If the PDF requires cookies, authentication headers, or is behind a login wall, the download will likely fail. The PDF must be publicly accessible or accessible via a direct URL with any required tokens embedded in the query string.

What metadata fields are extracted?

The API extracts seven metadata fields when available: title, author, subject, creator (the application that created the document), producer (the PDF library used), creation date, and modification date. Not all PDFs contain all metadata fields -- missing fields are returned as null.


Built by George The Developer on Apify.