PDF URL to Markdown, Tables & RAG Extractor
Pricing
from $1.50 / 1,000 results
PDF URL to Markdown, Tables & RAG Extractor
Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.
Pricing
from $1.50 / 1,000 results
Rating
0.0
(0)
Developer
Inus Grobler
Maintained by CommunityActor stats
1
Bookmarked
2
Total users
1
Monthly active users
7 days ago
Last modified
Categories
Share
PDF to Markdown & AI-Ready Document Extractor
Convert PDF URLs into clean Markdown and structured JSON for AI agents, RAG pipelines, document processing workflows, scraping pipelines, and downstream Apify Actors.
This Actor takes a PDF URL, extracts the document into clean Markdown, and returns page-level results that are useful for AI workflows, RAG pipelines, and structured document processing.
Features
- Convert PDF URLs to clean Markdown.
- Extract page-level text and page-level Markdown.
- Extract PDF metadata such as title, author, subject, creator, producer, dates, page count, file size, hash, and final URL.
- LLM modes enable table extraction and OCR fallback by default.
- Optional LLM cleanup with either the cheap or premium model.
- RAG-ready chunks with page references and source URL.
- Dynamic memory defaults: 512 MB for
no_llm, 1024 MB forllm_cheap, and 2048 MB forllm_premium. - Robust download logic with redirects, realistic headers, retries, PDF signature checks, size limits, and proxy fallback only when needed.
- One dataset item per processed page, so Apify's default result event can be used as per-page pricing.
Input Options
The public Apify input form has two fields: one PDF URL and one mode.
{"pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf","mode": "no_llm"}
LLM cleanup example:
{"pdfUrl": "https://example.com/document.pdf","mode": "llm_cheap"}
Visible fields:
pdfUrl: one PDF URL.mode:no_llm,llm_cheap, orllm_premium.
Advanced JSON/API fields are still supported for automation and legacy integrations, but they are not shown in the public form.
Run In Apify
- Open the Apify Store page.
- Click
Try for free. - Paste your PDF link into
pdfUrl. - Choose a mode:
no_llmfor the cheapest and fastest run.llm_cheapfor AI-ready extraction with lower LLM cost.llm_premiumfor harder PDFs where you want the best cleanup.
- Start the run.
- When the run finishes:
- Open the
Datasettab for page-by-page results. - Open the
Key-value storetab for the full document Markdown saved underOUTPUT_MARKDOWN.
- Open the
Modes
no_llm: fast PDF extraction with no LLM, OCR, or table extraction. This is the lowest-cost mode.llm_cheap: AI-ready extraction with RAG chunks, table extraction, OCR fallback, and selective cheap-model cleanup on pages that need help most.llm_premium: AI-ready extraction with RAG chunks, table extraction, OCR fallback, and selective premium-model cleanup on pages that need the strongest repair.
For most users, the published Actor is already configured and you only need to choose the mode in the input form.
Output Format
The Actor pushes one dataset item per processed page. This means Apify's apify-default-dataset-item result event acts as per-page pricing for successful PDFs. Failed PDFs still push one failure row.
If every processed page has empty Markdown, the Actor suppresses dataset output for that PDF so users are not charged for empty page rows.
Document-level Markdown and optional artifacts are saved in the key-value store. Each page row includes document metadata plus the current page's text, Markdown, tables, links, and matching RAG chunks.
In practice, most users only need two outputs:
Dataset: page-by-page rows with page Markdown, page text, tables, warnings, and metadata.Key-value store: the full combined Markdown document underOUTPUT_MARKDOWN.
{"sourceUrl": "https://example.com/document.pdf","finalUrl": "https://example.com/document.pdf","status": "success","recordType": "page","fileName": "document.pdf","fileSizeBytes": 842193,"contentHash": "sha256-hash-here","title": "Document title","author": "Author","subject": "Subject","createdDate": "2026-01-01T00:00:00Z","modifiedDate": "2026-02-01T00:00:00Z","pageCount": 12,"processedPageCount": 12,"language": "en","processedAt": "2026-05-07T00:00:00.000Z","processingDurationMs": 1842,"mode": "llm_cheap","inputMode": "llm_cheap","processingMode": "ai_ready","llmPreset": "llm_cheap","page": 1,"pageNumber": 1,"pageIndex": 0,"isFirstPage": true,"isLastProcessedPage": false,"markdownText": "Markdown for this page","markdown": "Markdown for this page","text": "Raw page text...","pageMarkdownText": "Markdown for this page","pageMarkdown": "Markdown for this page","pageText": "Raw page text...","pages": [{"page": 1,"text": "Raw page text...","markdown": "Markdown for this page","tables": [],"links": [],"textCharCount": 1234,"markdownCharCount": 1250,"tableCount": 0,"linkCount": 0,"source": "native","qualityScore": 260}],"tables": [{"tableIndex": 0,"page": 1,"markdown": "| Item | Price |\\n| --- | --- |\\n| Example | R120 |","rows": [{"Item": "Example","Price": "R120"}],"rowCount": 2,"columnCount": 2,"confidence": 0.82,"extractionMethod": "pdfplumber"}],"ragChunks": [{"chunkId": "stable-short-id","chunkIndex": 0,"pageStart": 1,"pageEnd": 2,"text": "Chunk text...","markdown": "Chunk markdown...","charCount": 842,"tokenEstimate": 211,"headings": ["Document heading"],"sourceUrl": "https://example.com/document.pdf"}],"summary": "Optional summary.","keywords": ["optional", "keywords"],"extractedData": null,"documentStats": {"markdownCharCount": 58214,"rawTextCharCount": 54008,"tableCount": 3,"ragChunkCount": 49,"emptyPageCount": 0,"ocrUsed": false,"llmCleanupUsed": false},"download": {"attempts": 1,"usedProxy": false,"contentType": "application/pdf"},"outputKeys": {"markdown": "OUTPUT_MARKDOWN"},"documentMarkdownKey": "OUTPUT_MARKDOWN","warnings": [],"errors": []}
Failed items are still pushed:
{"sourceUrl": "https://example.com/broken.pdf","status": "failed","recordType": "failure","processedAt": "2026-05-07T00:00:00.000Z","errors": [{"step": "download","message": "Failed to download PDF after retries"}],"warnings": []}
The full document Markdown is stored in the key-value store under OUTPUT_MARKDOWN for single-PDF runs, or OUTPUT_MARKDOWN_001, OUTPUT_MARKDOWN_002, and so on for batches. The Actor does not build one combined Markdown file for all PDFs, which keeps batch memory usage lower. Dataset items include documentStats, download, and outputKeys objects for monitoring and downstream automation.
Use Cases
- Convert PDFs to Markdown for AI prompts and agents.
- Prepare PDFs for RAG ingestion and vector databases.
- Extract page-level text with source references.
- Extract tables for finance, procurement, research, and compliance workflows.
- Clean messy PDF text with optional LLM cleanup.
- Process scanned PDFs with OCR fallback.
- Feed downstream Apify Actors with consistent document JSON.
Cost Notes
no_llmis the cheapest mode.llm_cheapuses the cheaper LLM model.llm_premiumuses the premium LLM model for harder PDFs.- The Actor uses 512 MB for
no_llm, 1024 MB forllm_cheap, and 2048 MB forllm_premiumby default. - The default run timeout is 3600 seconds on Apify so large LLM PDFs have room to finish.
- OCR and table extraction are off in
no_llmmode to keep runs cheap. - OCR fallback and table extraction are enabled in
llm_cheapandllm_premiumbecause those modes carry the higher paid feature set. - Large text PDFs use a fast native extraction path before heavier cleanup, which keeps
llm_cheapmore efficient. - Very large PDFs may skip some heavier extraction steps, including structured table extraction, to avoid timeout and memory failures.
- Long documents are compacted before document-level LLM tasks.
- Page-image export, source PDF saving, diagnostics, OCR, and LLM tasks can increase compute or storage costs.
Limitations
- Some scanned PDFs require OCR, and OCR quality varies by document quality and scan clarity.
- Complex, nested, or visually designed tables may not extract perfectly.
- LLM cleanup can improve formatting but may introduce interpretation.
- Very large PDFs may take longer or need advanced page limits for testing.
- Password-protected or encrypted PDFs are not supported.
- Full embedded image extraction is not implemented yet; page PNG export is available for review.
Run From Python
Set APIFY_TOKEN in your environment, then use a script like this:
import osimport jsonimport urllib.parseimport urllib.requesttoken = os.environ["APIFY_TOKEN"]actor_id = "thescrapelab/Apify-PDF-url-scraper"run_input = {"pdfUrl": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf","mode": "llm_premium",}url = (f"https://api.apify.com/v2/acts/{actor_id.replace('/', '~')}/runs"+ "?"+ urllib.parse.urlencode({"token": token,"waitForFinish": 300,}))request = urllib.request.Request(url,data=json.dumps(run_input).encode("utf-8"),headers={"Content-Type": "application/json"},method="POST",)with urllib.request.urlopen(request) as response:run = json.load(response)["data"]print("Run status:", run["status"])print("Run ID:", run["id"])print("Dataset URL:", f"https://console.apify.com/storage/datasets/{run['defaultDatasetId']}")print("Key-value store URL:", f"https://console.apify.com/storage/key-value-stores/{run['defaultKeyValueStoreId']}")
What this script does:
- Starts the Actor and waits for the run to finish.
- Returns the finished run metadata.
- Gives you the dataset and key-value store IDs for the output.
FAQ
Does it use an LLM by default?
No. The default no_llm mode does not use the LLM.
Can it process multiple PDFs?
The public form is single-URL by design. API users can still run batch workflows for automation.
Does it support RAG?
Yes. llm_cheap and llm_premium create source-aware chunks by default. Advanced API users can also enable chunks in other modes.
Does it extract tables?
no_llm skips table extraction for speed and cost. llm_cheap and llm_premium enable table extraction by default. Very large PDFs may skip structured table extraction to keep runs stable. Complex tables may still need review.
Does premium clean every page?
No. llm_cheap and llm_premium are selective cleanup modes. The Actor sends only the pages that look weak or messy enough to benefit from LLM repair, which keeps cost lower on large PDFs.
What happens on broken URLs?
The Actor pushes a failed dataset item with status: "failed" and an errors array describing the failed step.
Why are there only two inputs?
The Apify form shows the options clients actually need: URLs and LLM mode. Advanced controls remain available through JSON/API input for power users and integrations.