PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

Under maintenance

Pricing

from $1.50 / 1,000 results

Go to Apify Store
PDF URL to Markdown, Tables & RAG Extractor

PDF URL to Markdown, Tables & RAG Extractor

Under maintenance

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

20 minutes ago

Last modified

Share

PDF to Markdown & AI-Ready Document Extractor

Convert PDF URLs into clean Markdown and structured JSON for AI agents, RAG pipelines, document processing workflows, scraping pipelines, and downstream Apify Actors.

This Actor downloads one PDF URL, extracts page-level content, converts the document to Markdown, optionally uses an OpenRouter LLM for cleanup, and can create source-aware RAG chunks.

Features

  • Convert PDF URLs to clean Markdown.
  • Extract page-level text and page-level Markdown.
  • Extract PDF metadata such as title, author, subject, creator, producer, dates, page count, file size, hash, and final URL.
  • LLM modes enable table extraction and OCR fallback by default.
  • Optional LLM cleanup with either the cheap or premium OpenRouter model.
  • RAG-ready chunks with page references and source URL.
  • Dynamic memory defaults: 512 MB for no_llm, 1024 MB for llm_cheap, and 2048 MB for llm_premium.
  • Robust download logic with redirects, realistic headers, retries, PDF signature checks, size limits, and proxy fallback only when needed.
  • One dataset item per processed page, so Apify's default result event can be used as per-page pricing.

Input Options

The public Apify input form has two fields: one PDF URL and one mode.

{
"pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"mode": "no_llm"
}

LLM cleanup example:

{
"pdfUrl": "https://example.com/document.pdf",
"mode": "llm_cheap"
}

Visible fields:

  • pdfUrl: one PDF URL.
  • mode: no_llm, llm_cheap, or llm_premium.

Advanced JSON/API fields are still supported for automation and legacy integrations, but they are not shown in the public form.

Modes

  • no_llm: fast PDF extraction with no LLM, OCR, or table extraction. This is the lowest-cost mode.
  • llm_cheap: AI-ready extraction with RAG chunks, table extraction, OCR fallback, and OpenRouter cheap-model cleanup.
  • llm_premium: AI-ready extraction with RAG chunks, table extraction, OCR fallback, and OpenRouter premium-model cleanup.

LLM Configuration

LLM usage is optional and off by default for cost control. In production, AI features use the OpenRouter API key configured on the Actor.

OpenRouter is the native provider used by the existing Actor path:

  • OPENROUTER_API_KEY
  • OPENROUTER_CHEAP_MODEL
  • OPENROUTER_PREMIUM_MODEL
  • OPENROUTER_MODEL
  • OPENROUTER_INPUT_COST_PER_MILLION
  • OPENROUTER_OUTPUT_COST_PER_MILLION
  • PDFPLUMBER_ENABLE_TEXT_TABLES for aggressive whitespace-based table detection, disabled by default to avoid false positives.

Do not ask users to paste API keys into the input form. Configure keys as Actor environment variables or Apify secrets.

Output Format

The Actor pushes one dataset item per processed page. This means Apify's apify-default-dataset-item result event acts as per-page pricing for successful PDFs. Failed PDFs still push one failure row.

Document-level Markdown and optional artifacts are saved in the key-value store. Each page row includes document metadata plus the current page's text, Markdown, tables, links, and matching RAG chunks.

{
"sourceUrl": "https://example.com/document.pdf",
"finalUrl": "https://example.com/document.pdf",
"status": "success",
"recordType": "page",
"fileName": "document.pdf",
"fileSizeBytes": 842193,
"contentHash": "sha256-hash-here",
"title": "Document title",
"author": "Author",
"subject": "Subject",
"createdDate": "2026-01-01T00:00:00Z",
"modifiedDate": "2026-02-01T00:00:00Z",
"pageCount": 12,
"processedPageCount": 12,
"language": "en",
"processedAt": "2026-05-07T00:00:00.000Z",
"processingDurationMs": 1842,
"mode": "llm_cheap",
"inputMode": "llm_cheap",
"processingMode": "ai_ready",
"llmPreset": "llm_cheap",
"page": 1,
"pageNumber": 1,
"pageIndex": 0,
"isFirstPage": true,
"isLastProcessedPage": false,
"markdownText": "Markdown for this page",
"markdown": "Markdown for this page",
"text": "Raw page text...",
"pageMarkdownText": "Markdown for this page",
"pageMarkdown": "Markdown for this page",
"pageText": "Raw page text...",
"pages": [
{
"page": 1,
"text": "Raw page text...",
"markdown": "Markdown for this page",
"tables": [],
"links": [],
"textCharCount": 1234,
"markdownCharCount": 1250,
"tableCount": 0,
"linkCount": 0,
"source": "native",
"qualityScore": 260
}
],
"tables": [
{
"tableIndex": 0,
"page": 1,
"markdown": "| Item | Price |\\n| --- | --- |\\n| Example | R120 |",
"rows": [
{
"Item": "Example",
"Price": "R120"
}
],
"rowCount": 2,
"columnCount": 2,
"confidence": 0.82,
"extractionMethod": "pdfplumber"
}
],
"ragChunks": [
{
"chunkId": "stable-short-id",
"chunkIndex": 0,
"pageStart": 1,
"pageEnd": 2,
"text": "Chunk text...",
"markdown": "Chunk markdown...",
"charCount": 842,
"tokenEstimate": 211,
"headings": ["Document heading"],
"sourceUrl": "https://example.com/document.pdf"
}
],
"summary": "Optional summary.",
"keywords": ["optional", "keywords"],
"extractedData": null,
"documentStats": {
"markdownCharCount": 58214,
"rawTextCharCount": 54008,
"tableCount": 3,
"ragChunkCount": 49,
"emptyPageCount": 0,
"ocrUsed": false,
"llmCleanupUsed": false
},
"download": {
"attempts": 1,
"usedProxy": false,
"contentType": "application/pdf"
},
"outputKeys": {
"markdown": "OUTPUT_MARKDOWN"
},
"documentMarkdownKey": "OUTPUT_MARKDOWN",
"warnings": [],
"errors": []
}

Failed items are still pushed:

{
"sourceUrl": "https://example.com/broken.pdf",
"status": "failed",
"recordType": "failure",
"processedAt": "2026-05-07T00:00:00.000Z",
"errors": [
{
"step": "download",
"message": "Failed to download PDF after retries"
}
],
"warnings": []
}

The full document Markdown is stored in the key-value store under OUTPUT_MARKDOWN for single-PDF runs, or OUTPUT_MARKDOWN_001, OUTPUT_MARKDOWN_002, and so on for batches. The Actor does not build one combined Markdown file for all PDFs, which keeps batch memory usage lower. Dataset items include documentStats, download, and outputKeys objects for monitoring and downstream automation.

Use Cases

  • Convert PDFs to Markdown for AI prompts and agents.
  • Prepare PDFs for RAG ingestion and vector databases.
  • Extract page-level text with source references.
  • Extract tables for finance, procurement, research, and compliance workflows.
  • Clean messy PDF text with optional LLM cleanup.
  • Process scanned PDFs with OCR fallback.
  • Feed downstream Apify Actors with consistent document JSON.

Cost Notes

  • no_llm is the cheapest mode.
  • llm_cheap uses the cheaper OpenRouter model.
  • llm_premium uses the premium OpenRouter model for harder PDFs.
  • The Actor uses 512 MB for no_llm, 1024 MB for llm_cheap, and 2048 MB for llm_premium by default.
  • The default run timeout is 3600 seconds on Apify so large LLM PDFs have room to finish.
  • OCR and table extraction are off in no_llm mode to keep runs cheap.
  • OCR fallback and table extraction are enabled in llm_cheap and llm_premium because those modes carry the higher paid feature set.
  • Large text PDFs use a fast native extraction path before heavier cleanup, which keeps llm_cheap more efficient.
  • Table extraction uses conservative pdfplumber strategies by default. Enable PDFPLUMBER_ENABLE_TEXT_TABLES=true only when whitespace-based tables are more important than speed/noise control.
  • Long documents are compacted before document-level LLM tasks.
  • Page-image export, source PDF saving, diagnostics, OCR, and LLM tasks can increase compute or storage costs.

Limitations

  • Some scanned PDFs require OCR, and OCR quality depends on installed Tesseract language packs.
  • Complex, nested, or visually designed tables may not extract perfectly.
  • LLM cleanup can improve formatting but may introduce interpretation.
  • Very large PDFs may take longer or need advanced page limits for testing.
  • Password-protected or encrypted PDFs are not supported.
  • Full embedded image extraction is not implemented yet; page PNG export is available for review.

Local Checks

Install dependencies and run tests:

python3 -m venv .venv-local
.venv-local/bin/pip install -r requirements.txt
.venv-local/bin/python -m unittest discover

Run locally through the Apify runtime or CLI with an input similar to:

{
"pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"mode": "no_llm"
}

FAQ

Does it use an LLM by default?

No. The default no_llm mode does not use the LLM.

Can it process multiple PDFs?

The public form is single-URL by design. Advanced/API batch input with pdfUrls is still accepted for automation.

Does it support RAG?

Yes. llm_cheap and llm_premium create source-aware chunks by default. Advanced API users can also enable chunks in other modes.

Does it extract tables?

no_llm skips table extraction for speed and cost. llm_cheap and llm_premium enable table extraction by default using pdfplumber heuristics. Complex tables may still need review.

What happens on broken URLs?

The Actor pushes a failed dataset item with status: "failed" and an errors array describing the failed step.

Why are there only two inputs?

The Apify form shows the options clients actually need: URLs and LLM mode. Advanced controls remain available through JSON/API input for power users and integrations.