PDF to Text API | Document Extraction for LLMs & RAG
Pricing
from $1.00 / 1,000 document converteds
PDF to Text API | Document Extraction for LLMs & RAG
Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.
Pricing
from $1.00 / 1,000 document converteds
Rating
0.0
(0)
Developer

Andok
Actor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
PDF to Text Converter for AI & RAG
Extract clean text and metadata from PDF documents at scale for RAG pipelines, search indexing, and LLM ingestion. Point the actor at any PDF URL and get structured text output without installing local tools. Process entire document libraries in a single run.
Features
- Full text extraction — extracts all readable text from PDF documents using pdf-parse
- Metadata parsing — captures page count, PDF version, author, title, and creation date
- Bulk processing — convert hundreds of PDFs in a single run
- URL-based input — no file uploads needed, just provide URLs pointing to PDF files
- Configurable concurrency — process 1 to 50 PDFs in parallel
- Error resilience — failed documents are reported with error details, not skipped silently
Input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | array | Yes | — | List of URLs pointing to PDF files to extract text from |
timeoutSeconds | integer | No | 30 | Maximum seconds to wait for each PDF download |
concurrency | integer | No | 5 | Number of PDFs to process in parallel (1-50) |
Input Example
{"urls": ["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"],"timeoutSeconds": 30,"concurrency": 5}
Output
Each PDF produces one dataset item containing the extracted text and document metadata.
Key output fields:
inputUrl(string) — the original PDF URL providedstatus(number) — HTTP status code from the downloadpageCount(number) — number of pages in the PDFinfo(object) — PDF metadata including title, author, creator, producer, and datestext(string) — the full extracted text contenterror(string) — error message if extraction failed, otherwise absent
Output Example
{"inputUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf","status": 200,"pageCount": 1,"info": {"Title": "Dummy PDF file","Author": null,"Creator": "Writer","Producer": "OpenOffice.org 2.1","CreationDate": "D:20070223175637+02'00'"},"text": "Dummy PDF file\n\nThis is a dummy PDF file for testing purposes."}
Pricing
| Event | Cost |
|---|---|
| Document Converted | Pay-per-event (see actor pricing page) |
The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.
Use Cases
- RAG document ingestion — extract text from PDF knowledge bases for vector database indexing
- Search indexing — make PDF content searchable by extracting and indexing the text
- Compliance review — bulk-extract text from policy documents and contracts for automated analysis
- Academic research — convert research papers to plain text for NLP processing and citation analysis
- Data migration — extract content from legacy PDF archives into structured text formats
Related Actors
| Actor | What it adds |
|---|---|
| Web Page to Markdown Converter for LLMs | Convert web pages to Markdown alongside your PDF pipeline |
| Article Text Extractor for TTS & AI | Extract article text from web pages for a complete content pipeline |
| HTML Table Extractor | Extract structured table data from web pages |