PDF Parser API
Pricing
Pay per usage
PDF Parser API
Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
George Kioko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
8 hours ago
Last modified
Categories
Share
PDF Parser API - Extract Text & Metadata from PDF Files
A fast, reliable PDF parser API that extracts text content, metadata, page count, and word count from any publicly accessible PDF file. Simply provide a PDF URL and get back structured JSON with the full text and document properties -- perfect for RAG pipelines, document processing, and AI training data preparation.
Built as an always-on Standby API on Apify, it responds instantly with no cold starts, no queues, and no SDK required.
Key Features
- Full text extraction -- get every word from any PDF, ready for indexing or NLP
- Rich metadata -- title, author, subject, creator, producer, creation/modification dates
- Page & word counts -- instant document statistics without downloading the file yourself
- PDF version detection -- know exactly what PDF spec the document uses
- GET and POST endpoints -- use query parameters or JSON body, your choice
- CORS enabled -- call directly from browser-based apps
- Magic-byte validation -- rejects non-PDF files before wasting parse time
- Password-protected detection -- returns a clear error instead of crashing
- Streaming size guard -- enforces the 50 MB limit even when Content-Length is missing
How It Works
flowchart LRA["Client\n(curl / Python / JS)"] -->|HTTP GET or POST\nwith PDF URL| B["PDF Parser API\n(Apify Standby)"]B -->|Download PDF| C["Remote PDF\nServer"]C -->|PDF binary| BB -->|pdf-parse\nprocessing| D["Extracted Data"]D -->|JSON response| Astyle A fill:#e8f4fd,stroke:#2196F3style B fill:#fff3e0,stroke:#FF9800style C fill:#f3e5f5,stroke:#9C27B0style D fill:#e8f5e9,stroke:#4CAF50
Endpoints
| Method | Path | Description |
|---|---|---|
GET | /parse?url=<pdf_url> | Parse a PDF by passing the URL as a query parameter |
POST | /parse | Parse a PDF by sending {"url": "<pdf_url>"} as JSON body |
GET | /health | Health check -- returns {"status": "ok"} |
GET | / | Service info with usage instructions |
Input
GET request
Pass the PDF URL as a query parameter:
GET /parse?url=https://example.com/document.pdf
POST request
Send a JSON body with the url field:
{"url": "https://example.com/document.pdf"}
Output
A successful response returns structured JSON:
{"success": true,"pages": 12,"text": "Full extracted text content of the PDF document...","metadata": {"title": "Annual Report 2025","author": "Jane Smith","subject": "Financial Summary","creator": "Microsoft Word","producer": "macOS Quartz PDFContext","creationDate": "D:20250115102030Z","modDate": "D:20250120083000Z"},"pdfVersion": "1.7","textLength": 48320,"wordCount": 7841,"processingTimeMs": 342}
Error response
{"success": false,"error": "PDF is password-protected and cannot be parsed."}
How to Use
Using curl (GET)
$curl "https://pdf-parser-api.apify.actor/parse?url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
Using curl (POST)
curl -X POST "https://pdf-parser-api.apify.actor/parse" \-H "Content-Type: application/json" \-d '{"url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}'
Health check
$curl "https://pdf-parser-api.apify.actor/health"
Integration Examples
Python
import requestsresponse = requests.get("https://pdf-parser-api.apify.actor/parse",params={"url": "https://example.com/report.pdf"})data = response.json()print(f"Pages: {data['pages']}")print(f"Words: {data['wordCount']}")print(f"Title: {data['metadata']['title']}")print(f"Text preview: {data['text'][:500]}")
Node.js
const response = await fetch("https://pdf-parser-api.apify.actor/parse", {method: "POST",headers: { "Content-Type": "application/json" },body: JSON.stringify({url: "https://example.com/report.pdf",}),});const data = await response.json();console.log(`Pages: ${data.pages}`);console.log(`Words: ${data.wordCount}`);console.log(`Text preview: ${data.text.slice(0, 500)}`);
RAG Pipeline (Python + LangChain)
import requestsfrom langchain.text_splitter import RecursiveCharacterTextSplitter# Extract text from PDFresp = requests.get("https://pdf-parser-api.apify.actor/parse",params={"url": "https://example.com/knowledge-base.pdf"})pdf_data = resp.json()# Chunk for vector storesplitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)chunks = splitter.split_text(pdf_data["text"])# Each chunk is ready for embedding and indexingprint(f"Split {pdf_data['wordCount']} words into {len(chunks)} chunks")
Use Cases
- RAG pipelines -- extract text from PDFs and chunk it for vector databases (Pinecone, Weaviate, Chroma)
- Document processing -- batch-process invoices, contracts, and reports into structured data
- AI training data -- convert PDF corpora into clean text for fine-tuning language models
- Legal & compliance -- parse regulatory filings, court documents, and compliance reports at scale
- Academic research -- extract text from research papers for citation analysis or literature reviews
- Content migration -- pull text from legacy PDF archives into modern CMS platforms
- Search indexing -- feed PDF content into Elasticsearch, Algolia, or Meilisearch
Pricing
| Event | Cost |
|---|---|
| PDF parsed successfully | $0.004 per PDF |
You only pay when a PDF is successfully parsed. Failed requests (invalid URL, timeout, password-protected files) are not charged.
Limitations
| Constraint | Limit |
|---|---|
| Maximum file size | 50 MB |
| Download timeout | 60 seconds |
| Request body size | 1 MB (for POST requests) |
| Scanned PDFs | No OCR -- only digitally created PDFs with embedded text are supported |
| Password-protected PDFs | Not supported -- returns a clear error message |
| Protocols | HTTP and HTTPS only -- no local file paths or FTP |
FAQ
Does this API support scanned PDFs or images inside PDFs?
No. This API extracts embedded text from digitally created PDFs. If a PDF was created by scanning paper documents and contains only images, the extracted text will be empty or minimal. For scanned PDFs, you would need an OCR service as a preprocessing step.
What happens if the PDF is too large or the download times out?
The API enforces a 50 MB file size limit and a 60-second download timeout. If either limit is exceeded, you will receive a clear error response with the appropriate HTTP status code (413 for size, 408 for timeout). You are not charged for failed requests.
Can I parse PDFs that require authentication or are behind a login?
The API fetches PDFs from the URL you provide using a standard HTTP request. If the PDF requires cookies, authentication headers, or is behind a login wall, the download will likely fail. The PDF must be publicly accessible or accessible via a direct URL with any required tokens embedded in the query string.
What metadata fields are extracted?
The API extracts seven metadata fields when available: title, author, subject, creator (the application that created the document), producer (the PDF library used), creation date, and modification date. Not all PDFs contain all metadata fields -- missing fields are returned as null.
Built by George The Developer on Apify.