Pricing

from $4.00 / 1,000 pdf-parseds

PDF Parser API

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

Pricing

from $4.00 / 1,000 pdf-parseds

Rating

0.0

(0)

Developer

George Kioko

Actor stats

Bookmarked

Total users

Monthly active users

3 months ago

Last modified

PDF Parser API - Extract Text & Metadata from PDF Files

A fast, reliable PDF parser API that extracts text content, metadata, page count, and word count from any publicly accessible PDF file. Simply provide a PDF URL and get back structured JSON with the full text and document properties -- perfect for RAG pipelines, document processing, and AI training data preparation.

Built as an always-on Standby API on Apify, it responds instantly with no cold starts, no queues, and no SDK required.

Key Features

Full text extraction -- get every word from any PDF, ready for indexing or NLP
Rich metadata -- title, author, subject, creator, producer, creation/modification dates
Page & word counts -- instant document statistics without downloading the file yourself
PDF version detection -- know exactly what PDF spec the document uses
GET and POST endpoints -- use query parameters or JSON body, your choice
CORS enabled -- call directly from browser-based apps
Magic-byte validation -- rejects non-PDF files before wasting parse time
Password-protected detection -- returns a clear error instead of crashing
Streaming size guard -- enforces the 50 MB limit even when Content-Length is missing

How It Works

flowchart LR
    A["Client\n(curl / Python / JS)"] -->|HTTP GET or POST\nwith PDF URL| B["PDF Parser API\n(Apify Standby)"]
    B -->|Download PDF| C["Remote PDF\nServer"]
    C -->|PDF binary| B
    B -->|pdf-parse\nprocessing| D["Extracted Data"]
    D -->|JSON response| A

    style A fill:#e8f4fd,stroke:#2196F3
    style B fill:#fff3e0,stroke:#FF9800
    style C fill:#f3e5f5,stroke:#9C27B0
    style D fill:#e8f5e9,stroke:#4CAF50

Endpoints

Method	Path	Description
`GET`	`/parse?url=<pdf_url>`	Parse a PDF by passing the URL as a query parameter
`POST`	`/parse`	Parse a PDF by sending `{"url": "<pdf_url>"}` as JSON body
`GET`	`/health`	Health check -- returns `{"status": "ok"}`
`GET`	`/`	Service info with usage instructions

Input

GET request

Pass the PDF URL as a query parameter:

GET /parse?url=https://example.com/document.pdf

POST request

Send a JSON body with the url field:

{
  "url": "https://example.com/document.pdf"
}

Output

A successful response returns structured JSON:

{
  "success": true,
  "pages": 12,
  "text": "Full extracted text content of the PDF document...",
  "metadata": {
    "title": "Annual Report 2025",
    "author": "Jane Smith",
    "subject": "Financial Summary",
    "creator": "Microsoft Word",
    "producer": "macOS Quartz PDFContext",
    "creationDate": "D:20250115102030Z",
    "modDate": "D:20250120083000Z"
  },
  "pdfVersion": "1.7",
  "textLength": 48320,
  "wordCount": 7841,
  "processingTimeMs": 342
}

Error response

{
  "success": false,
  "error": "PDF is password-protected and cannot be parsed."
}

How to Use

Using curl (GET)

$curl "https://pdf-parser-api.apify.actor/parse?url=https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"

Using curl (POST)

curl -X POST "https://pdf-parser-api.apify.actor/parse" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"}'

Health check

$curl "https://pdf-parser-api.apify.actor/health"

Integration Examples

Python

import requests

response = requests.get(
    "https://pdf-parser-api.apify.actor/parse",
    params={"url": "https://example.com/report.pdf"}
)
data = response.json()

print(f"Pages: {data['pages']}")
print(f"Words: {data['wordCount']}")
print(f"Title: {data['metadata']['title']}")
print(f"Text preview: {data['text'][:500]}")

Node.js

const response = await fetch("https://pdf-parser-api.apify.actor/parse", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com/report.pdf",
  }),
});

const data = await response.json();
console.log(`Pages: ${data.pages}`);
console.log(`Words: ${data.wordCount}`);
console.log(`Text preview: ${data.text.slice(0, 500)}`);

RAG Pipeline (Python + LangChain)

import requests
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Extract text from PDF
resp = requests.get(
    "https://pdf-parser-api.apify.actor/parse",
    params={"url": "https://example.com/knowledge-base.pdf"}
)
pdf_data = resp.json()

# Chunk for vector store
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_text(pdf_data["text"])

# Each chunk is ready for embedding and indexing
print(f"Split {pdf_data['wordCount']} words into {len(chunks)} chunks")

Use Cases

RAG pipelines -- extract text from PDFs and chunk it for vector databases (Pinecone, Weaviate, Chroma)
Document processing -- batch-process invoices, contracts, and reports into structured data
AI training data -- convert PDF corpora into clean text for fine-tuning language models
Legal & compliance -- parse regulatory filings, court documents, and compliance reports at scale
Academic research -- extract text from research papers for citation analysis or literature reviews
Content migration -- pull text from legacy PDF archives into modern CMS platforms
Search indexing -- feed PDF content into Elasticsearch, Algolia, or Meilisearch

Pricing

Event	Cost
PDF parsed successfully	$0.004 per PDF

You only pay when a PDF is successfully parsed. Failed requests (invalid URL, timeout, password-protected files) are not charged.

Limitations

Constraint	Limit
Maximum file size	50 MB
Download timeout	60 seconds
Request body size	1 MB (for POST requests)
Scanned PDFs	No OCR -- only digitally created PDFs with embedded text are supported
Password-protected PDFs	Not supported -- returns a clear error message
Protocols	HTTP and HTTPS only -- no local file paths or FTP

FAQ

Does this API support scanned PDFs or images inside PDFs?

No. This API extracts embedded text from digitally created PDFs. If a PDF was created by scanning paper documents and contains only images, the extracted text will be empty or minimal. For scanned PDFs, you would need an OCR service as a preprocessing step.

What happens if the PDF is too large or the download times out?

The API enforces a 50 MB file size limit and a 60-second download timeout. If either limit is exceeded, you will receive a clear error response with the appropriate HTTP status code (413 for size, 408 for timeout). You are not charged for failed requests.

The API fetches PDFs from the URL you provide using a standard HTTP request. If the PDF requires cookies, authentication headers, or is behind a login wall, the download will likely fail. The PDF must be publicly accessible or accessible via a direct URL with any required tokens embedded in the query string.

What metadata fields are extracted?

The API extracts seven metadata fields when available: title, author, subject, creator (the application that created the document), producer (the PDF library used), creation date, and modification date. Not all PDFs contain all metadata fields -- missing fields are returned as null.

Built by George The Developer on Apify.

Pdf API

vivid_astronaut/pdf

Fabio Suizu

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

515

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

NanoScrape

PDF Text Extractor

automation-lab/pdf-text-extractor

Extract text, metadata, and page-by-page content from PDF files. Provide PDF URLs and get structured JSON with full text, per-page text, page count, author, title, creation date, and more. Export as JSON, CSV, or Excel. No browser or proxy needed.

Stas Persiianenko

122

PDF Toolkit — Extract Text, Metadata & Page Count

accurate_pouch/pdf-toolkit

Extract text from PDFs, read metadata (title, author, dates), count pages. Bulk processing from URLs. $0.003 per PDF.

Manchitt Sanan

Fast Pdf Processor

contemporary_fruit/pdf-processor-actor

This API is a PDF Processing Service allowing users to upload a PDF to: Extract Text: Reads all text from the PDF and returns it as structured JSON data per page. Merge Pages: Creates a new PDF containing only the specific pages selected by the user. (260 characters)

Andric

📄 PDF Text Extractor

scraper-engine/pdf-text-extractor

📄✨ PDF Text Extractor extracts clean text from PDF files with precision. ⚡ Perfect for data mining, document processing, and searchable archives. 🚀 Fast, reliable, and efficient for your workflow!

Scraper Engine

📄 PDF Text Extractor

scrapio/pdf-text-extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.

Scrapio

PDF Extractor: Structured Text + Metadata

aitoolbreakdown/atb-pdf-extractor

Point it at one or many PDF URLs. Get clean structured JSON back: full text, per-page text, title, author, page count, and word count. Ready for RAG, search, or doc automation.