Pricing

from $1.00 / 1,000 document converteds

PDF to Text API | Document Extraction for LLMs & RAG

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Pricing

from $1.00 / 1,000 document converteds

Rating

0.0

(0)

Developer

Andok

Actor stats

Bookmarked

Total users

Monthly active users

4 months ago

Last modified

PDF to Text Converter for AI & RAG

Extract clean text and metadata from PDF documents at scale for RAG pipelines, search indexing, and LLM ingestion. Point the actor at any PDF URL and get structured text output without installing local tools. Process entire document libraries in a single run.

Features

Full text extraction — extracts all readable text from PDF documents using pdf-parse
Metadata parsing — captures page count, PDF version, author, title, and creation date
Bulk processing — convert hundreds of PDFs in a single run
URL-based input — no file uploads needed, just provide URLs pointing to PDF files
Configurable concurrency — process 1 to 50 PDFs in parallel
Error resilience — failed documents are reported with error details, not skipped silently

Input

Field	Type	Required	Default	Description
`urls`	`array`	Yes	—	List of URLs pointing to PDF files to extract text from
`timeoutSeconds`	`integer`	No	`30`	Maximum seconds to wait for each PDF download
`concurrency`	`integer`	No	`5`	Number of PDFs to process in parallel (1-50)

Input Example

{
  "urls": [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
  ],
  "timeoutSeconds": 30,
  "concurrency": 5
}

Output

Each PDF produces one dataset item containing the extracted text and document metadata.

Key output fields:

inputUrl (string) — the original PDF URL provided
status (number) — HTTP status code from the download
pageCount (number) — number of pages in the PDF
info (object) — PDF metadata including title, author, creator, producer, and dates
text (string) — the full extracted text content
error (string) — error message if extraction failed, otherwise absent

Output Example

{
  "inputUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
  "status": 200,
  "pageCount": 1,
  "info": {
    "Title": "Dummy PDF file",
    "Author": null,
    "Creator": "Writer",
    "Producer": "OpenOffice.org 2.1",
    "CreationDate": "D:20070223175637+02'00'"
  },
  "text": "Dummy PDF file\n\nThis is a dummy PDF file for testing purposes."
}

Pricing

Event	Cost
Document Converted	Pay-per-event (see actor pricing page)

The actor respects the per-run max charge limit. Processing stops automatically when the spending cap is reached.

Use Cases

RAG document ingestion — extract text from PDF knowledge bases for vector database indexing
Search indexing — make PDF content searchable by extracting and indexing the text
Compliance review — bulk-extract text from policy documents and contracts for automated analysis
Academic research — convert research papers to plain text for NLP processing and citation analysis
Data migration — extract content from legacy PDF archives into structured text formats

Actor	What it adds
Web Page to Markdown Converter for LLMs	Convert web pages to Markdown alongside your PDF pipeline
Article Text Extractor for TTS & AI	Extract article text from web pages for a complete content pipeline
HTML Table Extractor	Extract structured table data from web pages

PDF to Markdown & JSON (RAG-Ready)

basisweb/pdf-to-markdown-rag

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

BasisWeb

Website to Markdown for LLMs and RAG

rodrgds/website-to-markdown

Convert webpages into clean markdown for LLMs, RAG pipelines, AI datasets, archives, and content extraction. Simple pay-per-page pricing.

Rodrigo Dias

PDF & Document to Markdown - PDF, DOCX & HTML for LLMs

entranced_gelato/ai-document-reader

Turn any PDF, DOCX, TXT, or HTML document into clean, LLM-ready text + Markdown with metadata (title, pages, word count) and an optional AI summary. The document counterpart to a web reader — built for RAG ingestion, document Q&A, and AI agents (LangChain, LlamaIndex). Fast, structured, single-call.

AIDevs

📄 PDF Text Extractor

api-empire/pdf-text-extractor

📄 PDF Text Extractor effortlessly converts PDF files into searchable text and clean output. ⚡ Fast, accurate, and user-friendly—ideal for document analysis, data extraction, and content indexing. 🚀 Perfect for research, compliance, and automation.

API Empire

Contextractor — clean web content extraction for LLMs

glueo/contextractor

Crawl any website and extract clean main-content text as Markdown, plain text, JSON, or HTML — ready for LLMs, RAG pipelines, and vector databases. Built on the rs-trafilatura engine and an adaptive Crawlee + Playwright crawler.

Glueo

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

PDF Text Extractor - Bulk PDF to Text & Metadata

santamaria-automations/pdf-extractor

Extract text and metadata from any PDF URL in bulk. Get page content, author, title, creation date, and more. Detects scanned PDFs that need OCR. Perfect for document analysis, research, and compliance.

NanoScrape

Website to Markdown for RAG & LLMs

hereditary_model/website-to-markdown

Crawls a website and converts every page into clean Markdown, ready for RAG pipelines, vector databases, and LLM context. Pay per page converted.

Aaron Marxsen

PDF Parser API

george.the.developer/pdf-parser-api

Instant API that parses any PDF from a URL — extracts full text, page count, metadata (title, author, dates), and PDF version. Returns structured JSON. Perfect for document processing pipelines and AI agents.

George Kioko

📄 PDF Text Extractor

scrapio/pdf-text-extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.