PDF to Text Extractor avatar

PDF to Text Extractor

Under maintenance

Pricing

Pay per usage

Go to Apify Store
PDF to Text Extractor

PDF to Text Extractor

Under maintenance

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Donny Nguyen

Donny Nguyen

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

a day ago

Last modified

Categories

Share

What does it do?

PDF to Text Extractor downloads PDFs from URLs you provide and extracts the text content, metadata, page counts, and optionally detects tables. It processes PDFs in bulk, producing structured output with clean text suitable for AI processing, search indexing, document analysis, and data extraction workflows.

Why use this actor?

Processing PDFs at scale is a common requirement for data pipelines, document management systems, and AI applications. This actor handles the entire workflow: downloading PDFs from any URL, parsing the binary content, extracting text and metadata, detecting tables, and delivering structured results. It eliminates the need to set up PDF processing infrastructure yourself.

How to use it

  1. Go to the actor's page on the Apify platform.
  2. Click Start to open the input configuration.
  3. Enter one or more PDF URLs to process.
  4. Choose whether to extract tables.
  5. Click Start and wait for the results.
  6. Download your extracted text from the Dataset tab.

The actor handles PDFs of various sizes and formats, extracting all available text content.

Input configuration

FieldTypeDescriptionDefault
pdfUrlsarrayURLs of PDFs to extract text from["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"]
extractTablesbooleanDetect and extract table datatrue
proxyConfigurationobjectProxy settingsApify Proxy

Output data

Each item in the dataset contains:

{
"url": "https://example.com/report.pdf",
"title": "Annual Report 2025",
"text": "This report covers the financial performance...",
"pageCount": 24,
"wordCount": 15200,
"charCount": 89400,
"author": "Finance Department",
"tables": ["Header1\\tHeader2\\tHeader3\\nVal1\\tVal2\\tVal3"],
"tableCount": 3,
"fileSizeKB": 450,
"scrapedAt": "2026-02-19T14:30:00.000Z"
}

Cost of usage

This actor processes PDFs using CheerioCrawler and the pdf-parse library. A typical run processing 10 PDFs takes about 1-2 minutes and costs under $0.02 in platform credits, depending on PDF size. The actor is priced at $0.75 per 1,000 results with pay-per-event pricing. Large PDFs may require more memory.

Tips

  • PDFs must be publicly accessible via URL for the actor to download them.
  • Scanned PDFs (images of text) will not produce text output since OCR is not included.
  • The text extraction works best with digitally created PDFs, not scanned documents.
  • Table detection uses heuristics based on tab separators and whitespace patterns.
  • Set the memory to 512 MB or higher when processing large PDFs (50+ pages).
  • The author and creation date come from the PDF metadata, which may not always be present.
  • Use this alongside the URL to LLM Dataset actor for a complete AI data pipeline.

Built with Crawlee and Apify SDK. See more scrapers by consummate_mandala on Apify Store.