Pricing

from $1.50 / 1,000 results

Go to Apify Store

PDF URL to Markdown, Tables & RAG Extractor

Try for free

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Actor stats

Bookmarked

Total users

Monthly active users

21 days ago

Last modified

PDF to Markdown & AI-Ready Document Extractor

Convert PDF URLs into clean Markdown and structured JSON for AI agents, RAG pipelines, document processing workflows, scraping pipelines, and downstream Apify Actors.

This Actor takes a PDF URL, extracts the document into clean Markdown, and returns page-level results that are useful for AI workflows, RAG pipelines, and structured document processing.

Features

Convert PDF URLs to clean Markdown.
Extract page-level text and page-level Markdown.
Extract PDF metadata such as title, author, subject, creator, producer, dates, page count, file size, hash, and final URL.
LLM modes enable table extraction and OCR fallback by default.
Optional LLM cleanup with either the cheap or premium model.
RAG-ready chunks with page references and source URL.
Dynamic memory defaults: 512 MB for no_llm, 1024 MB for llm_cheap, and 2048 MB for llm_premium.
Robust download logic with redirects, realistic headers, retries, PDF signature checks, size limits, and proxy fallback only when needed.
One dataset item per processed page, so Apify's default result event can be used as per-page pricing.

Input Options

The public Apify input form has two fields: one PDF URL and one mode.

{
  "pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
  "mode": "no_llm"
}

LLM cleanup example:

{
  "pdfUrl": "https://example.com/document.pdf",
  "mode": "llm_cheap"
}

Visible fields:

pdfUrl: one PDF URL.
mode: no_llm, llm_cheap, or llm_premium.

Advanced JSON/API fields are still supported for automation and legacy integrations, but they are not shown in the public form.

Run In Apify

Open the Apify Store page.
Click Try for free.
Paste your PDF link into pdfUrl.
Choose a mode:
- no_llm for the cheapest and fastest run.
- llm_cheap for AI-ready extraction with lower LLM cost.
- llm_premium for harder PDFs where you want the best cleanup.
Start the run.
When the run finishes:
- Open the Dataset tab for page-by-page results.
- Open the Key-value store tab for the full document Markdown saved under OUTPUT_MARKDOWN.

Modes

no_llm: fast PDF extraction with no LLM, OCR, or table extraction. This is the lowest-cost mode.
llm_cheap: AI-ready extraction with RAG chunks, table extraction, OCR fallback, and selective cheap-model cleanup on pages that need help most.
llm_premium: AI-ready extraction with RAG chunks, table extraction, OCR fallback, and selective premium-model cleanup on pages that need the strongest repair.

For most users, the published Actor is already configured and you only need to choose the mode in the input form.

Output Format

The Actor pushes one dataset item per processed page. This means Apify's apify-default-dataset-item result event acts as per-page pricing for successful PDFs. Failed PDFs still push one failure row.

If every processed page has empty Markdown, the Actor suppresses dataset output for that PDF so users are not charged for empty page rows.

Document-level Markdown and optional artifacts are saved in the key-value store. Each page row includes document metadata plus the current page's text, Markdown, tables, links, and matching RAG chunks.

In practice, most users only need two outputs:

Dataset: page-by-page rows with page Markdown, page text, tables, warnings, and metadata.
Key-value store: the full combined Markdown document under OUTPUT_MARKDOWN.

{
  "sourceUrl": "https://example.com/document.pdf",
  "finalUrl": "https://example.com/document.pdf",
  "status": "success",
  "recordType": "page",
  "fileName": "document.pdf",
  "fileSizeBytes": 842193,
  "contentHash": "sha256-hash-here",
  "title": "Document title",
  "author": "Author",
  "subject": "Subject",
  "createdDate": "2026-01-01T00:00:00Z",
  "modifiedDate": "2026-02-01T00:00:00Z",
  "pageCount": 12,
  "processedPageCount": 12,
  "language": "en",
  "processedAt": "2026-05-07T00:00:00.000Z",
  "processingDurationMs": 1842,
  "mode": "llm_cheap",
  "inputMode": "llm_cheap",
  "processingMode": "ai_ready",
  "llmPreset": "llm_cheap",
  "page": 1,
  "pageNumber": 1,
  "pageIndex": 0,
  "isFirstPage": true,
  "isLastProcessedPage": false,
  "markdownText": "Markdown for this page",
  "markdown": "Markdown for this page",
  "text": "Raw page text...",
  "pageMarkdownText": "Markdown for this page",
  "pageMarkdown": "Markdown for this page",
  "pageText": "Raw page text...",
  "pages": [
    {
      "page": 1,
      "text": "Raw page text...",
      "markdown": "Markdown for this page",
      "tables": [],
      "links": [],
      "textCharCount": 1234,
      "markdownCharCount": 1250,
      "tableCount": 0,
      "linkCount": 0,
      "source": "native",
      "qualityScore": 260
    }
  ],
  "tables": [
    {
      "tableIndex": 0,
      "page": 1,
      "markdown": "| Item | Price |\\n| --- | --- |\\n| Example | R120 |",
      "rows": [
        {
          "Item": "Example",
          "Price": "R120"
        }
      ],
      "rowCount": 2,
      "columnCount": 2,
      "confidence": 0.82,
      "extractionMethod": "pdfplumber"
    }
  ],
  "ragChunks": [
    {
      "chunkId": "stable-short-id",
      "chunkIndex": 0,
      "pageStart": 1,
      "pageEnd": 2,
      "text": "Chunk text...",
      "markdown": "Chunk markdown...",
      "charCount": 842,
      "tokenEstimate": 211,
      "headings": ["Document heading"],
      "sourceUrl": "https://example.com/document.pdf"
    }
  ],
  "summary": "Optional summary.",
  "keywords": ["optional", "keywords"],
  "extractedData": null,
  "documentStats": {
    "markdownCharCount": 58214,
    "rawTextCharCount": 54008,
    "tableCount": 3,
    "ragChunkCount": 49,
    "emptyPageCount": 0,
    "ocrUsed": false,
    "llmCleanupUsed": false
  },
  "download": {
    "attempts": 1,
    "usedProxy": false,
    "contentType": "application/pdf"
  },
  "outputKeys": {
    "markdown": "OUTPUT_MARKDOWN"
  },
  "documentMarkdownKey": "OUTPUT_MARKDOWN",
  "warnings": [],
  "errors": []
}

Failed items are still pushed:

{
  "sourceUrl": "https://example.com/broken.pdf",
  "status": "failed",
  "recordType": "failure",
  "processedAt": "2026-05-07T00:00:00.000Z",
  "errors": [
    {
      "step": "download",
      "message": "Failed to download PDF after retries"
    }
  ],
  "warnings": []
}

The full document Markdown is stored in the key-value store under OUTPUT_MARKDOWN for single-PDF runs, or OUTPUT_MARKDOWN_001, OUTPUT_MARKDOWN_002, and so on for batches. The Actor does not build one combined Markdown file for all PDFs, which keeps batch memory usage lower. Dataset items include documentStats, download, and outputKeys objects for monitoring and downstream automation.

Use Cases

Convert PDFs to Markdown for AI prompts and agents.
Prepare PDFs for RAG ingestion and vector databases.
Extract page-level text with source references.
Extract tables for finance, procurement, research, and compliance workflows.
Clean messy PDF text with optional LLM cleanup.
Process scanned PDFs with OCR fallback.
Feed downstream Apify Actors with consistent document JSON.

Cost Notes

no_llm is the cheapest mode.
llm_cheap uses the cheaper LLM model.
llm_premium uses the premium LLM model for harder PDFs.
The Actor uses 512 MB for no_llm, 1024 MB for llm_cheap, and 2048 MB for llm_premium by default.
The default run timeout is 3600 seconds on Apify so large LLM PDFs have room to finish.
OCR and table extraction are off in no_llm mode to keep runs cheap.
OCR fallback and table extraction are enabled in llm_cheap and llm_premium because those modes carry the higher paid feature set.
Large text PDFs use a fast native extraction path before heavier cleanup, which keeps llm_cheap more efficient.
Very large PDFs may skip some heavier extraction steps, including structured table extraction, to avoid timeout and memory failures.
Long documents are compacted before document-level LLM tasks.
Page-image export, source PDF saving, diagnostics, OCR, and LLM tasks can increase compute or storage costs.

Limitations

Some scanned PDFs require OCR, and OCR quality varies by document quality and scan clarity.
Complex, nested, or visually designed tables may not extract perfectly.
LLM cleanup can improve formatting but may introduce interpretation.
Very large PDFs may take longer or need advanced page limits for testing.
Password-protected or encrypted PDFs are not supported.
Full embedded image extraction is not implemented yet; page PNG export is available for review.

Run From Python

Set APIFY_TOKEN in your environment, then use a script like this:

import os
import json
import urllib.parse
import urllib.request


token = os.environ["APIFY_TOKEN"]
actor_id = "thescrapelab/Apify-PDF-url-scraper"
run_input = {
    "pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
    "mode": "llm_premium",
}

url = (
    f"https://api.apify.com/v2/acts/{actor_id.replace('/', '~')}/runs"
    + "?"
    + urllib.parse.urlencode(
        {
            "token": token,
            "waitForFinish": 300,
        }
    )
)

request = urllib.request.Request(
    url,
    data=json.dumps(run_input).encode("utf-8"),
    headers={"Content-Type": "application/json"},
    method="POST",
)

with urllib.request.urlopen(request) as response:
    run = json.load(response)["data"]

print("Run status:", run["status"])
print("Run ID:", run["id"])
print("Dataset URL:", f"https://console.apify.com/storage/datasets/{run['defaultDatasetId']}")
print("Key-value store URL:", f"https://console.apify.com/storage/key-value-stores/{run['defaultKeyValueStoreId']}")

What this script does:

Starts the Actor and waits for the run to finish.
Returns the finished run metadata.
Gives you the dataset and key-value store IDs for the output.

FAQ

Does it use an LLM by default?

No. The default no_llm mode does not use the LLM.

Can it process multiple PDFs?

The public form is single-URL by design. API users can still run batch workflows for automation.

Does it support RAG?

Yes. llm_cheap and llm_premium create source-aware chunks by default. Advanced API users can also enable chunks in other modes.

Does it extract tables?

no_llm skips table extraction for speed and cost. llm_cheap and llm_premium enable table extraction by default. Very large PDFs may skip structured table extraction to keep runs stable. Complex tables may still need review.

Does premium clean every page?

No. llm_cheap and llm_premium are selective cleanup modes. The Actor sends only the pages that look weak or messy enough to benefit from LLM repair, which keeps cost lower on large PDFs.

What happens on broken URLs?

The Actor pushes a failed dataset item with status: "failed" and an errors array describing the failed step.

Why are there only two inputs?

The Apify form shows the options clients actually need: URLs and LLM mode. Advanced controls remain available through JSON/API input for power users and integrations.

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Dmitry Goncharov

Website & PDF to RAG JSONL Crawler

orbiscribe/linked-pdf-website-rag-crawler

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

Orbiscribe Labs

Fast Website to Markdown & RAG JSONL Crawler

orbiscribe/website-rag-dataset-builder

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

Orbiscribe Labs

Web Page to Clean Markdown

consistent_tradition/web-to-markdown

Extracts clean Markdown text from any web page. Perfect for AI/RAG datasets, research corpora, and content analysis.

Peter PANG

PDF AI Extractor MCP

devaditya/pdf-ai-extractor-mcp

Extracts text, tables, summaries, and structured data from any PDF using OpenAI, Google Gemini, or Claude. Supports bulk AI processing, clean JSON exports, and an AI-ready MCP mode for agent workflows.

lalithhh

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

PDF Intelligence

marielise.dev/pdf-intelligence

Stop fighting PDFs. Extract text, tables, and insights from any document, scanned or digital. Get RAG-ready chunks for LangChain & LlamaIndex. AI-powered summaries, classification, entity extraction. Use our API keys or bring your own (50% discount). From PDF chaos to clean data in minutes.

Marielise

RAG Document Converter

web.harvester/rag-document-converter

Convert PDF, DOCX, PPTX, and other documents to clean Markdown optimized for RAG pipelines. Preserves structure, tables, and headers. Powered by IBM Docling.

Web Harvester

Extractor from PDF URL

zayn_0001/extractor-from-pdf-url

Extract text and tables from PDFs in a clear, readable format. Provides well-organized tables and cleans up messy spacing, making PDF content easy to view, copy, or share—directly from a PDF link.