Deprecated

Pricing

Pay per usage

See alternative Actors

Go to Apify Store

AI Data Extraction from PDF

Deprecated

See alternative Actors

Extract text data from PDF files using AI. Upload PDFs directly or provide URLs. Supports text chunking for LLM workflows.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Actor4you

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

What is AI Data Extraction from PDF?

AI Data Extraction from PDF is a cloud-based tool that lets you extract text from PDF documents at scale. Upload PDF files directly in the Apify Console or provide URLs to PDF files hosted online - no coding required. This powerful PDF text extractor supports text chunking for seamless integration with LLM and RAG pipelines, making it the go-to PDF scraper for batch processing.

What can AI Data Extraction from PDF do?

Dual input method - Upload PDFs directly or paste URLs to online PDF files. No other pdf scraper gives you this flexibility.
Smart text chunking - Split extracted text into configurable chunks with customizable overlap, purpose-built for RAG and AI workflows.
Batch PDF processing - Process hundreds of PDF documents in a single run. Convert PDF to text format at scale.
REST API access - Call the text extraction API programmatically from any language or platform using the Apify API.
Scheduling - Set up recurring runs to process new PDFs automatically on a schedule.
Webhooks & integrations - Connect to Slack, Google Sheets, Zapier, Make, or your own endpoints. Get notified when PDF data extraction completes.
Cloud-based - No local installation, no dependencies. Runs on Apify's infrastructure with automatic scaling.
Export anywhere - Download results as JSON, CSV, XML, or Excel. Push data directly to databases or APIs.

What data can you extract from PDF?

Field	Type	Description
`url`	String	Source URL of the processed PDF file
`index`	Number	Page or chunk number (starting from 0)
`text`	String	Extracted text content - clean, structured, and ready for processing

Each PDF produces one or more dataset items depending on the number of pages and your chunking configuration. The output is structured for immediate use in data pipelines, spreadsheets, or AI applications.

How to use AI Data Extraction from PDF to extract text

Go to the Actor page - Navigate to AI Data Extraction from PDF on Apify Store and click Try for free.
Upload your PDFs or add URLs - Use the Upload PDF Files field to drag and drop your documents, or paste direct links into the PDF URLs field. You can use both methods simultaneously.
Configure chunking (optional) - Toggle Perform Chunking if you need the text split into smaller segments. Set your preferred Chunk Size (characters per chunk) and Chunk Overlap (characters shared between consecutive chunks).
Start the extraction - Click Start and wait for the run to complete. The Actor processes each PDF and pushes extracted text to the dataset.
Download your data - Open the Dataset tab to preview results. Export as JSON, CSV, XML, or Excel, or access results via the API.

How much does it cost to extract data from PDF?

AI Data Extraction from PDF runs on the Apify Free plan, which gives you $5 of free platform credits every month. A typical PDF extraction run costs well under $0.01 per document, meaning you can process hundreds of PDFs for free each month.

For higher volumes, paid plans offer more compute and storage. Platform usage is billed per compute unit consumed - there is no per-document fee. Check the Apify pricing page for current rates.

Input - configuration options

Field	Type	Default	Description
`pdfFiles`	File Upload (array)	-	Upload one or more PDF files directly in the Apify Console. Files are stored in a key-value store and processed automatically.
`urls`	String List (array)	-	URLs of PDF files hosted online. Paste direct links to `.pdf` files.
`performChunking`	Boolean	`false`	Enable text chunking to split extracted content into smaller segments. Essential for LLM and RAG workflows.
`chunkSize`	Integer	`1000`	Maximum number of characters per chunk. Only applies when chunking is enabled.
`chunkOverlap`	Integer	`0`	Number of overlapping characters between consecutive chunks. Helps preserve context at chunk boundaries.

You must provide at least one PDF - either via upload or URL. Both input methods can be used together in the same run.

Output example - extracted text from PDF

[
    {
        "url": "https://example.com/report-2024.pdf",
        "index": 0,
        "text": "Annual Report 2024. Executive Summary. This report presents the financial results and strategic initiatives undertaken during the fiscal year 2024. Total revenue increased by 12% year-over-year, driven primarily by growth in digital services..."
    },
    {
        "url": "https://example.com/report-2024.pdf",
        "index": 1,
        "text": "...driven primarily by growth in digital services and international expansion. Operating margins improved to 18.3%, reflecting cost optimization measures implemented in Q2. The company invested $45M in research and development..."
    },
    {
        "url": "https://example.com/invoice-march.pdf",
        "index": 0,
        "text": "Invoice #INV-2024-0342. Date: March 15, 2024. Bill To: Acme Corporation. Description: Cloud infrastructure services - March 2024. Amount: $12,500.00. Payment Terms: Net 30."
    }
]

Use cases - who should use this PDF data extraction tool?

Finance & accounting - Extract data from invoices, receipts, and financial statements. Automate document-to-text conversion for bookkeeping workflows.
Research & academia - Pull text from research papers, journals, and academic PDFs. Build searchable databases of scientific literature.
Business intelligence - Convert PDF reports into structured data for analysis. Feed quarterly reports, market research, and white papers into your data pipeline.
AI & LLM pipelines - Use the built-in chunking feature to prepare PDF content for retrieval-augmented generation (RAG). Feed properly sized text chunks directly into vector databases or language models.
Legal document processing - Extract text from contracts, court filings, and regulatory documents. Process large volumes of legal PDFs for review and analysis.
Enterprise batch processing - Process hundreds of PDFs in a single run. Schedule daily or weekly extractions for incoming document streams using Apify's scheduling and webhook features.

FAQ - PDF data extraction questions

Is it legal to extract text from PDF files?

Yes. Extracting text from PDF files you own or have permission to access is perfectly legal. This tool processes the documents you provide - it does not scrape third-party websites. Always ensure you have the right to process the PDFs you upload or link to.

Can this tool handle scanned PDFs or images inside PDFs?

This Actor works best with text-based PDFs - documents where the text is embedded as selectable content. Scanned PDFs that contain only images may return limited or no text. For scanned documents, consider using an OCR-capable tool first, then processing the output with this Actor.

How does text chunking work, and when should I use it?

When Perform Chunking is enabled, the extracted text is split into segments of up to chunkSize characters. The chunkOverlap parameter controls how many characters are shared between consecutive chunks, which helps preserve context at boundaries. Use chunking when you plan to feed the text into a large language model, vector database, or any system with token or character limits.

Is there a limit on the number or size of PDFs I can process?

There is no hard limit on the number of PDFs per run. Processing time and cost scale with the total volume of data. Very large PDFs (hundreds of pages) will produce more dataset items and use more compute time. For extremely large batches, consider splitting your input across multiple runs.

What output formats are available?

The Actor outputs structured data to an Apify Dataset. You can export results as JSON, CSV, XML, Excel, or RSS. You can also access the data programmatically via the Apify API, or push it directly to external services using integrations and webhooks.

Bulk Pdf To Json OCR

gagandeo/bulk-pdf-to-json-ocr

Convert PDF invoices, menus, images with text and documents into structured JSON. Features hybrid Digital+OCR parsing and AI-powered data extraction.

Kumar Gagandeo

Website Intelligence API

ladra/Website-Intelligence-API

Crawl any public website and turn it into AI-ready intelligence. Extract screenshots, Markdown, HTML, metadata, links, PDFs, compliance evidence, RAG chunks, and structured JSON for sales research, audits, website snapshots, and automation.

Ladra Software

ArXiv Paper Scraper — Search by Category, Bulk JSON, DOI

knotless_cadence/arxiv-paper-scraper

arXiv corpus as JSON — arxivId, title, authors, abstract, categories, dates, DOI, PDF URL. By search OR category. Built for ML/AI training data + lit reviews. 19 runs. Backed by 951-run Trustpilot flagship + 31-actor portfolio. spinov001@gmail.com · blog.spinov.online · t.me/scraping_ai

Alex

SmartSchema Extract — Text to JSON with AI

olican/smartschema-extract

Convert any unstructured text into validated JSON using Google Gemini. Define your JSON Schema per request. Perfect for invoice parsing, web scraping, email extraction, and ETL pipelines.

Sergio Calvo

5.0

(2)

Google Search Results Scraper

kawsar/google-search-results-scraper

Google Search Results Scraper that extracts titles, URLs, snippets, and positions for any keyword, so you can automate SEO research, rank tracking, and competitive analysis without manual searching.

Kawsar

SEC EDGAR Data Scraper

thescrapelab/Apify-SEC-EDGAR-data

High-speed, browserless extraction of SEC EDGAR filings (10-K, 10-Q, 8-K, Form 4) by ticker symbol. Get structured company data, document manifests, and historical records in seconds without the overhead of a headless browser.

Inus Grobler

arXiv Paper Scraper - AI ML Research Papers

openclawmara/arxiv-paper-scraper

Scrape arXiv research papers by keyword, category, or author. Extracts titles, abstracts, authors, citations, and metadata. Perfect for AI/ML research monitoring, literature reviews, and LLM training data collection.

OpenClaw Mara

Structured Data Extractor — URL to JSON

shelvick/structured-extractor

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

Scott Helvick

💎ESG Scraper: Sustainability Reports & PDF Disclosures

primeparse/esg-content-scraper

Powerful ESG scraper (Environmental, Social, and Governance) to automatically extract sustainability reports, PDF disclosures, articles, and content from any website. Get clean, AI-ready datasets with keyword filtering, metadata extraction, images, links, and full PDF support.

PrimeParse

5.0

(1)

MCP Nexus Universal AI Tool Bridge

tuguidragos/mcp-nexus-universal-ai-tool-bridge

Connect AI agents to real data. MCP Nexus runs tools that fetch, extract, summarize, classify and crawl web content with caching, multi LLM support, HMAC webhooks, circuit breakers and full observability in a stateless production ready Apify actor.

Țugui Dragoș

PDF Tools (Merge / Split / Compress / OCR / Watermark)

mrkrokko/pdf-tools

All-in-one PDF processor: merge multiple PDFs, split by page ranges, compress file size, extract text, OCR scanned documents (Tesseract), add text watermarks, rotate pages, and read metadata. Accepts PDF URLs or Key-Value Store keys.