PDF URL to Markdown, Tables & RAG Extractor avatar

PDF URL to Markdown, Tables & RAG Extractor

Pricing

from $1.50 / 1,000 results

Go to Apify Store
PDF URL to Markdown, Tables & RAG Extractor

PDF URL to Markdown, Tables & RAG Extractor

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Pricing

from $1.50 / 1,000 results

Rating

0.0

(0)

Developer

Inus Grobler

Inus Grobler

Maintained by Community

Actor stats

1

Bookmarked

6

Total users

4

Monthly active users

14 hours ago

Last modified

Share

PDF URL Scraper: PDF to Markdown and AI-Ready Document Extractor

PDF URL Scraper converts public PDF URLs into clean Markdown, page-level text, metadata, tables, and AI-ready JSON for RAG pipelines, document automation, research workflows, and downstream Apify Actors.

At a glance: input examples are one or more public PDF URLs; output examples are page-level dataset rows, Markdown records, metadata, tables, and optional AI-ready chunks; use cases include RAG and document automation; limitations, troubleshooting, and pricing/cost notes are covered below.

What this Actor does

Give the Actor one PDF URL or a list of PDF URLs. It downloads each PDF, extracts readable content, stores the full document Markdown in the key-value store, and pushes one dataset row per useful processed page.

The default mode does not use an LLM, which keeps small tests and bulk text extraction inexpensive. Optional LLM modes can improve messy pages, extract RAG chunks, and handle harder documents when quality matters more than minimum cost.

Main use cases

  • Convert PDF URLs to Markdown for AI prompts and agents.
  • Prepare documents for RAG ingestion and vector databases.
  • Extract page-level text with source URL and page references.
  • Extract tables from financial reports, forms, manuals, procurement documents, and research PDFs.
  • Process batches of public PDFs from web scraping or document monitoring workflows.
  • Store full-document Markdown and page-level JSON for downstream automation.

Simple input

Most users only need two fields.

{
"pdfUrls": [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"mode": "no_llm"
}

Basic fields

  • pdfUrls: One or more public PDF URLs. Put one PDF URL per row. Duplicate URLs are processed once to avoid duplicate output and unnecessary cost.
  • mode: Choose no_llm, llm_cheap, or llm_premium.

Mode guide

  • no_llm: Fastest and cheapest. Best for normal text PDFs and high-volume extraction.
  • llm_cheap: Adds AI-ready cleanup, RAG chunks, table extraction, and OCR fallback at lower LLM cost.
  • llm_premium: Uses the premium cleanup path for harder PDFs where output quality matters more than cost.

Legacy API calls using pdfUrl still work. Advanced API users can also use lower-level fields such as advancedMode, maxPages, includeRawText, saveDiagnostics, savePageMarkdown, savePageImages, proxyConfiguration, and custom request headers. These are optional and are not needed for normal runs.

Example batch input

{
"pdfUrls": [
"https://example.com/report-1.pdf",
"https://example.com/report-2.pdf",
"https://example.com/report-3.pdf"
],
"mode": "no_llm"
}

What data you get

The Actor pushes one dataset item per processed page. Each row can include:

  • Source URL and final URL after redirects.
  • Status and failure details, if a PDF could not be processed.
  • File name, file size, content hash, title, author, and PDF dates when available.
  • Page number, page text, and page Markdown.
  • Tables and table metadata when table extraction is enabled.
  • RAG chunks when AI-ready chunking is enabled.
  • Language estimate, processing duration, warnings, and download details.
  • Key-value store keys for the full Markdown document and optional artifacts.

Full-document Markdown is saved in the key-value store as OUTPUT_MARKDOWN for a single PDF, or OUTPUT_MARKDOWN_001, OUTPUT_MARKDOWN_002, and so on for batches.

Example output row

{
"sourceUrl": "https://example.com/document.pdf",
"finalUrl": "https://example.com/document.pdf",
"status": "success",
"recordType": "page",
"fileName": "document.pdf",
"pageCount": 12,
"processedPageCount": 12,
"pageNumber": 1,
"mode": "no_llm",
"processingMode": "fast",
"markdownText": "Markdown for this page...",
"pageText": "Raw page text...",
"tables": [],
"ragChunks": [],
"download": {
"attempts": 1,
"usedProxy": false,
"contentType": "application/pdf"
},
"outputKeys": {
"markdown": "OUTPUT_MARKDOWN"
},
"warnings": [],
"errors": []
}

Failed PDFs still produce a clear failure row when the Actor starts successfully:

{
"sourceUrl": "https://example.com/not-a-pdf",
"status": "failed",
"recordType": "failure",
"errors": [
{
"step": "download",
"message": "Failed to download PDF after retries"
}
],
"warnings": []
}

How to run on Apify

  1. Open the Actor page on Apify.
  2. Paste one or more PDF URLs into pdfUrls.
  3. Keep mode as no_llm for the cheapest run, or choose an LLM mode when you need cleanup, OCR fallback, tables, or RAG chunks.
  4. Start the run.
  5. Open the Dataset tab for page-by-page JSON results.
  6. Open the Key-value store tab to download full-document Markdown files.

Exporting results

You can export dataset rows from Apify as JSON, CSV, Excel, XML, or RSS. For document-level Markdown, open the run's key-value store and download the OUTPUT_MARKDOWN record or the numbered batch records.

Python API example

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run_input = {
"pdfUrls": [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"mode": "no_llm",
}
run = client.actor("thescrapelab/Apify-PDF-url-scraper").call(run_input=run_input)
dataset_id = run["defaultDatasetId"]
for item in client.dataset(dataset_id).iterate_items():
print(item["status"], item.get("pageNumber"), item.get("markdownText", "")[:120])

Advanced options

Advanced options are available through JSON or API input for automation workflows. Use them only when you need tighter control.

  • maxPages: Process only the first N pages of each PDF. Useful for samples and cost control.
  • includeRawText: Include full raw page text in dataset rows.
  • saveDiagnostics: Save page-level diagnostics to the key-value store.
  • savePageMarkdown: Save page Markdown records separately.
  • savePageImages: Save selected page PNGs in a ZIP file for review.
  • maxDownloadMb: Reject PDFs above a configured download size.
  • maxRetries: Limit retry attempts for unreliable URLs.
  • skipHeadPreflight: Skip the initial HEAD request for servers that block HEAD.
  • proxyConfiguration: Use custom proxy settings for sources that block direct requests.

Cost and pricing notes

Cost is mainly driven by memory, runtime, page count, storage writes, and whether LLM/OCR features are used.

  • Use no_llm for high-volume PDF-to-Markdown extraction.
  • Use maxPages when testing large PDFs.
  • Avoid savePageImages unless you need visual review artifacts.
  • Use LLM modes only when the output quality gain is worth the extra cost.
  • The recommended Store pricing model is pay per successful page result, with optional separate LLM page events if monetization is enabled.

Limits and caveats

  • The Actor works with public HTTP and HTTPS PDF URLs.
  • Password-protected or encrypted PDFs are not supported.
  • Some scanned PDFs require OCR, and OCR quality depends on scan quality.
  • Complex, nested, or visually designed tables may need review.
  • LLM cleanup can improve formatting but may introduce interpretation.
  • Very large PDFs can take longer; use maxPages for sampling or testing.
  • Duplicate input URLs are ignored at runtime to avoid duplicate results.

Troubleshooting

  • If a URL fails, confirm it opens directly in a browser and returns a PDF, not an HTML landing page.
  • If a server blocks downloads, try skipHeadPreflight or a proxy configuration.
  • If a run is expensive, switch to no_llm, add maxPages, and disable optional artifacts.
  • If output is empty, the PDF may be scanned, image-only, encrypted, or blocked by the source server.
  • If tables look imperfect, try an LLM mode and review the warnings field.

FAQ

Can this Actor scrape PDF URLs from a website?

This Actor processes PDF URLs you provide. If you need to discover PDF links from web pages first, run a web crawler or link scraper before this Actor.

Does it convert PDF to Markdown?

Yes. It saves full-document Markdown in the key-value store and page-level Markdown in the dataset.

Does it use an LLM by default?

No. The default no_llm mode avoids LLM calls for lower cost.

Can it process multiple PDFs in one run?

Yes. Add multiple URLs to pdfUrls. Duplicate URLs are processed once.

Does it support RAG?

Yes. llm_cheap and llm_premium create source-aware RAG chunks by default. Advanced users can also enable RAG chunks through API input.

Does it extract tables?

Table extraction is enabled in the LLM modes and can be controlled by advanced options. Complex tables may still need manual review.

What happens if one PDF fails in a batch?

The Actor pushes a failure row for that PDF and continues with the remaining URLs.

What is the best setting for large batches?

Use mode: "no_llm", keep optional artifacts disabled, and use maxPages when you only need a sample.