Under maintenance

Pricing

Pay per usage

Try for free

Go to Apify Store

PDF to Text Extractor

Under maintenance

Try for free

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

What does it do?

PDF to Text Extractor downloads PDFs from URLs you provide and extracts the text content, metadata, page counts, and optionally detects tables. It processes PDFs in bulk, producing structured output with clean text suitable for AI processing, search indexing, document analysis, and data extraction workflows.

Why use this actor?

Processing PDFs at scale is a common requirement for data pipelines, document management systems, and AI applications. This actor handles the entire workflow: downloading PDFs from any URL, parsing the binary content, extracting text and metadata, detecting tables, and delivering structured results. It eliminates the need to set up PDF processing infrastructure yourself.

How to use it

Go to the actor's page on the Apify platform.
Click Start to open the input configuration.
Enter one or more PDF URLs to process.
Choose whether to extract tables.
Click Start and wait for the results.
Download your extracted text from the Dataset tab.

The actor handles PDFs of various sizes and formats, extracting all available text content.

Input configuration

Field	Type	Description	Default
pdfUrls	array	URLs of PDFs to extract text from	["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"]
extractTables	boolean	Detect and extract table data	true
proxyConfiguration	object	Proxy settings	Apify Proxy

Output data

Each item in the dataset contains:

{
    "url": "https://example.com/report.pdf",
    "title": "Annual Report 2025",
    "text": "This report covers the financial performance...",
    "pageCount": 24,
    "wordCount": 15200,
    "charCount": 89400,
    "author": "Finance Department",
    "tables": ["Header1\\tHeader2\\tHeader3\\nVal1\\tVal2\\tVal3"],
    "tableCount": 3,
    "fileSizeKB": 450,
    "scrapedAt": "2026-02-19T14:30:00.000Z"
}

Cost of usage

This actor processes PDFs using CheerioCrawler and the pdf-parse library. A typical run processing 10 PDFs takes about 1-2 minutes and costs under $0.02 in platform credits, depending on PDF size. The actor is priced at $0.75 per 1,000 results with pay-per-event pricing. Large PDFs may require more memory.

Tips

PDFs must be publicly accessible via URL for the actor to download them.
Scanned PDFs (images of text) will not produce text output since OCR is not included.
The text extraction works best with digitally created PDFs, not scanned documents.
Table detection uses heuristics based on tab separators and whitespace patterns.
Set the memory to 512 MB or higher when processing large PDFs (50+ pages).
The author and creation date come from the PDF metadata, which may not always be present.
Use this alongside the URL to LLM Dataset actor for a complete AI data pipeline.

Built with Crawlee and Apify SDK. See more scrapers by consummate_mandala on Apify Store.

PDF Text Extractor

jirimoravcik/pdf-text-extractor

PDF Text Extractor allows you to extract text from PDF files. It also supports chunking of the text to prepare the data for usage with large language models.

Jiří Moravčík

967

5.0

(1)

Pdf Text Extractor Pro

dainty_screw/pdf-text-extractor-pro

PDF Text Extractor lets you quickly extract text from PDF files with high accuracy. Supports text chunking for AI, chatbots, and large language models (LLMs), making PDF-to-text conversion fast, clean, and ready for NLP or machine learning.

codemaster devops

5.0

(1)

Extract Pdf Text Extractor Pro — Data, Details & Metadata

tropical_quince/pdf-text-extractor-pro

Extract pdf text extractor pro data at scale with this powerful Apify actor. Extracts data, details & metadata with automatic pagination and proxy rotation. Perfect for market research, competitive intelligence, and data-driven decision making.

Donny Nguyen

PDF Text Extractor

sami_apify/PDF-Text-Extractor

This actor downloads PDFs from provided URLs, extracts text content from them, and saves the extracted data into an Apify dataset. It’s ideal for scraping and processing PDFs available online.

sami

Extract text from PDF

akash9078/pdf-text-extractor

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

Akash Kumar Naik

Ocr Pdf Extractor

vivid_astronaut/ocr-pdf-extractor

Extract text from images and PDFs using OCR. Supports multiple languages including English, Portuguese, Spanish, French, German. Uses Tesseract OCR engine with high accuracy text extraction and word-level confidence scores.

Fabio Suizu

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

ParseForge

5.0

(1)

Extractor from PDF URL

zayn_0001/extractor-from-pdf-url

Extract text and tables from PDFs in a clear, readable format. Provides well-organized tables and cleans up messy spacing, making PDF content easy to view, copy, or share—directly from a PDF link.

Muhammad Zain Abid

OCR Structured Extractor (AI) — Image/PDF → OCR Text + JSON

macheta/ocr-structured-extractor

Extract OCR text and structured JSON from an image or PDF URL. Great for invoices, receipts, forms, IDs, and tables. Powered by Gemini 3 Pro.

Anass

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.