PDF Parser API
Pricing
from $4.00 / 1,000 pdf-parseds
PDF Parser API
Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.
Pricing
from $4.00 / 1,000 pdf-parseds
Rating
0.0
(0)
Developer
George Kioko
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
13 days ago
Last modified
Categories
Share
PDF Parser API - Extract Text & Metadata from PDF Files
A fast, reliable PDF parser API that extracts text content, metadata, page count, and word count from any publicly accessible PDF file. Simply provide a PDF URL and get back structured JSON with the full text and document properties -- perfect for RAG pipelines, document processing, and AI training data preparation.
Built as an always-on Standby API on Apify, it responds instantly with no cold starts, no queues, and no SDK required.
Key Features
- Full text extraction -- get every word from any PDF, ready for indexing or NLP
- Rich metadata -- title, author, subject, creator, producer, creation/modification dates
- Page & word counts -- instant document statistics without downloading the file yourself
- PDF version detection -- know exactly what PDF spec the document uses
- GET and POST endpoints -- use query parameters or JSON body, your choice
- CORS enabled -- call directly from browser-based apps
- Magic-byte validation -- rejects non-PDF files before wasting parse time
- Password-protected detection -- returns a clear error instead of crashing
- Streaming size guard -- enforces the 50 MB limit even when Content-Length is missing
How It Works
flowchart LRA["Client\n(curl / Python / JS)"] -->|HTTP GET or POST\nwith PDF URL| B["PDF Parser API\n(Apify Standby)"]B -->|Download PDF| C["Remote PDF\nServer"]C -->|PDF binary| BB -->|pdf-parse\nprocessing| D["Extracted Data"]D -->|JSON response| Astyle A fill:#e8f4fd,stroke:#2196F3style B fill:#fff3e0,stroke:#FF9800style C fill:#f3e5f5,stroke:#9C27B0style D fill:#e8f5e9,stroke:#4CAF50
Endpoints
| Method | Path | Description |
|---|---|---|
GET | /parse?url=<pdf_url> | Parse a PDF by passing the URL as a query parameter |
POST | /parse | Parse a PDF by sending {"url": "<pdf_url>"} as JSON body |
GET | /health | Health check -- returns {"status": "ok"} |
GET | / | Service info with usage instructions |
Input
GET request
Pass the PDF URL as a query parameter:
GET /parse?url=https://example.com/document.pdf
POST request
Send a JSON body with the url field:
{"url": "https://example.com/document.pdf"}
Output
A successful response returns structured JSON:
{"success": true,"pages": 12,"text": "Full extracted text content of the PDF document...","metadata": {"title": "Annual Report 2025","author": "Jane Smith","subject": "Financial Summary","creator": "Microsoft Word","producer": "macOS Quartz PDFContext","creationDate": "D:20250115102030Z","modDate": "D:20250120083000Z"},"pdfVersion": "1.7","textLength": 48320,"wordCount": 7841,"processingTimeMs": 342}
Error response
{"success": false,"error": "PDF is password-protected and cannot be parsed."}
How to Use
Using curl (GET)
$curl "https://pdf-parser-api.apify.actor/parse?url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
Using curl (POST)
curl -X POST "https://pdf-parser-api.apify.actor/parse" \-H "Content-Type: application/json" \-d '{"url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}'
Health check
$curl "https://pdf-parser-api.apify.actor/health"
Integration Examples
Python
import requestsresponse = requests.get("https://pdf-parser-api.apify.actor/parse",params={"url": "https://example.com/report.pdf"})data = response.json()print(f"Pages: {data['pages']}")print(f"Words: {data['wordCount']}")print(f"Title: {data['metadata']['title']}")print(f"Text preview: {data['text'][:500]}")
Node.js
const response = await fetch("https://pdf-parser-api.apify.actor/parse", {method: "POST",headers: { "Content-Type": "application/json" },body: JSON.stringify({url: "https://example.com/report.pdf",}),});const data = await response.json();console.log(`Pages: ${data.pages}`);console.log(`Words: ${data.wordCount}`);console.log(`Text preview: ${data.text.slice(0, 500)}`);
RAG Pipeline (Python + LangChain)
import requestsfrom langchain.text_splitter import RecursiveCharacterTextSplitter# Extract text from PDFresp = requests.get("https://pdf-parser-api.apify.actor/parse",params={"url": "https://example.com/knowledge-base.pdf"})pdf_data = resp.json()# Chunk for vector storesplitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)chunks = splitter.split_text(pdf_data["text"])# Each chunk is ready for embedding and indexingprint(f"Split {pdf_data['wordCount']} words into {len(chunks)} chunks")
Use Cases
- RAG pipelines -- extract text from PDFs and chunk it for vector databases (Pinecone, Weaviate, Chroma)
- Document processing -- batch-process invoices, contracts, and reports into structured data
- AI training data -- convert PDF corpora into clean text for fine-tuning language models
- Legal & compliance -- parse regulatory filings, court documents, and compliance reports at scale
- Academic research -- extract text from research papers for citation analysis or literature reviews
- Content migration -- pull text from legacy PDF archives into modern CMS platforms
- Search indexing -- feed PDF content into Elasticsearch, Algolia, or Meilisearch
Pricing
| Event | Cost |
|---|---|
| PDF parsed successfully | $0.004 per PDF |
You only pay when a PDF is successfully parsed. Failed requests (invalid URL, timeout, password-protected files) are not charged.
Limitations
| Constraint | Limit |
|---|---|
| Maximum file size | 50 MB |
| Download timeout | 60 seconds |
| Request body size | 1 MB (for POST requests) |
| Scanned PDFs | No OCR -- only digitally created PDFs with embedded text are supported |
| Password-protected PDFs | Not supported -- returns a clear error message |
| Protocols | HTTP and HTTPS only -- no local file paths or FTP |
FAQ
Does this API support scanned PDFs or images inside PDFs?
No. This API extracts embedded text from digitally created PDFs. If a PDF was created by scanning paper documents and contains only images, the extracted text will be empty or minimal. For scanned PDFs, you would need an OCR service as a preprocessing step.
What happens if the PDF is too large or the download times out?
The API enforces a 50 MB file size limit and a 60-second download timeout. If either limit is exceeded, you will receive a clear error response with the appropriate HTTP status code (413 for size, 408 for timeout). You are not charged for failed requests.
Can I parse PDFs that require authentication or are behind a login?
The API fetches PDFs from the URL you provide using a standard HTTP request. If the PDF requires cookies, authentication headers, or is behind a login wall, the download will likely fail. The PDF must be publicly accessible or accessible via a direct URL with any required tokens embedded in the query string.
What metadata fields are extracted?
The API extracts seven metadata fields when available: title, author, subject, creator (the application that created the document), producer (the PDF library used), creation date, and modification date. Not all PDFs contain all metadata fields -- missing fields are returned as null.
Built by George The Developer on Apify.