PDF to Text API | Document Extraction for LLMs & RAG avatar

PDF to Text API | Document Extraction for LLMs & RAG

Pricing

from $1.00 / 1,000 document converteds

Go to Apify Store
PDF to Text API | Document Extraction for LLMs & RAG

PDF to Text API | Document Extraction for LLMs & RAG

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Pricing

from $1.00 / 1,000 document converteds

Rating

0.0

(0)

Developer

Andok

Andok

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Share

PDF to Text Converter for AI & RAG

Extract clean text and metadata from PDF documents at scale for RAG pipelines, search indexing, and LLM ingestion. Point the actor at any PDF URL and get structured text output without installing local tools. Process entire document libraries in a single run.

Features

  • Full text extraction — extracts all readable text from PDF documents using pdf-parse
  • Metadata parsing — captures page count, PDF version, author, title, and creation date
  • Bulk processing — convert hundreds of PDFs in a single run
  • URL-based input — no file uploads needed, just provide URLs pointing to PDF files
  • Configurable concurrency — process 1 to 50 PDFs in parallel
  • Error resilience — failed documents are reported with error details, not skipped silently

Input

FieldTypeRequiredDefaultDescription
urlsarrayYesList of URLs pointing to PDF files to extract text from
timeoutSecondsintegerNo30Maximum seconds to wait for each PDF download
concurrencyintegerNo5Number of PDFs to process in parallel (1-50)

Input Example

{
"urls": [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"timeoutSeconds": 30,
"concurrency": 5
}

Output

Each PDF produces one dataset item containing the extracted text and document metadata.

Key output fields:

  • inputUrl (string) — the original PDF URL provided
  • status (number) — HTTP status code from the download
  • pageCount (number) — number of pages in the PDF
  • info (object) — PDF metadata including title, author, creator, producer, and dates
  • text (string) — the full extracted text content
  • error (string) — error message if extraction failed, otherwise absent

Output Example

{
"inputUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"status": 200,
"pageCount": 1,
"info": {
"Title": "Dummy PDF file",
"Author": null,
"Creator": "Writer",
"Producer": "OpenOffice.org 2.1",
"CreationDate": "D:20070223175637+02'00'"
},
"text": "Dummy PDF file\n\nThis is a dummy PDF file for testing purposes."
}

Pricing

EventCost
Document ConvertedPay-per-event (see actor pricing page)

The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.

Use Cases

  • RAG document ingestion — extract text from PDF knowledge bases for vector database indexing
  • Search indexing — make PDF content searchable by extracting and indexing the text
  • Compliance review — bulk-extract text from policy documents and contracts for automated analysis
  • Academic research — convert research papers to plain text for NLP processing and citation analysis
  • Data migration — extract content from legacy PDF archives into structured text formats
ActorWhat it adds
Web Page to Markdown Converter for LLMsConvert web pages to Markdown alongside your PDF pipeline
Article Text Extractor for TTS & AIExtract article text from web pages for a complete content pipeline
HTML Table ExtractorExtract structured table data from web pages