PDF to Markdown & JSON (RAG-Ready) avatar

PDF to Markdown & JSON (RAG-Ready)

Pricing

from $2.00 / 1,000 page processeds

Go to Apify Store
PDF to Markdown & JSON (RAG-Ready)

PDF to Markdown & JSON (RAG-Ready)

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

Pricing

from $2.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

BasisWeb

BasisWeb

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

Convert PDFs into clean Markdown and structured JSON (text + tables) you can drop straight into a RAG pipeline, an LLM prompt, or a vector database. Give it a list of PDF URLs; it returns one record per page.

Think of it as the PDF companion to web crawlers like Website Content Crawler and RAG Web Browser: point it at the PDFs your crawler discovers and get clean, page-level text + tables back.

What it does

  • Downloads each PDF by URL.
  • Extracts text using the PDF's character layout (natural reading order for standard single-column pages) and detects tables, rendering them as Markdown tables and structured rows.
  • Returns one dataset item per page. url, page, totalPages, tableCount, and ok are always present; markdown and/or text + tables are included depending on outputFormat (default both returns all of them).

Use cases

  • RAG ingestion: turn reports, manuals, and whitepapers into clean, page-level Markdown chunks for a vector database.
  • LLM document Q&A: feed structured text and tables to an LLM without copy-paste cleanup.
  • Extract tables from PDF: pull tables out as both Markdown and structured rows.
  • Agent pipelines: chain it after a web crawler so an AI agent can read the PDFs it finds.

Input

FieldTypeDefaultDescription
pdfUrlsarray of URLs(required)Direct links to the PDFs to convert.
extractTablesbooleantrueDetect tables and render them as Markdown + structured rows.
outputFormatmarkdown | json | bothbothWhat each result includes.

Example input

{
"pdfUrls": ["https://example.com/report.pdf"],
"extractTables": true,
"outputFormat": "both"
}

Example output (one item per page)

{
"url": "https://example.com/report.pdf",
"page": 1,
"totalPages": 12,
"markdown": "Q3 Report\nRevenue grew 18% YoY...\n\n| Region | Revenue |\n| --- | --- |\n| NA | $4.1M |",
"text": "Q3 Report\nRevenue grew 18% YoY...",
"tables": [[["Region", "Revenue"], ["NA", "$4.1M"]]],
"tableCount": 1,
"ok": true
}

The example above uses the default outputFormat: "both", so it includes every field. Each item also includes ok (set to false on a failed URL or page, with an error field explaining why) and, on pages with no extractable text or tables, a note flag.

Pricing (pay-per-event)

  • Run start: a small flat fee per run (Apify's built-in start event).
  • Page processed: charged per page that returns real content (text and/or tables).

Pages with no extractable text or tables are returned with a note and are NOT charged. Failed URLs and failed pages are reported with an error and are never charged.

Your spending limit is always respected: set a max cost per run and the Actor stops once it's reached.

Use it with AI agents (Apify MCP)

This Actor is available as a tool for AI agents through Apify's MCP server (mcp.apify.com). An agent can call basisweb/pdf-to-markdown-rag to convert a PDF to Markdown mid-task, then chain the page-level output into the next step. The only required input is pdfUrls, so an agent can invoke it in one shot, and the output schema tells the agent exactly which fields come back (markdown, text, tables, tableCount) before it spends a credit.

Honest notes

  • This handles digital, text-based PDFs. Scanned PDFs (image-only, no text layer) are not OCR'd in this version; those pages come back with a note instead of text and are not charged. OCR is planned for a future version.
  • Each PDF must be under 50 MB. Very large or table-heavy PDFs run best at 2 GB memory or higher (the 50 MB limit caps file size, not parsing memory).
  • You can also parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

FAQ

Does it work on scanned PDFs? Not in this version. Image-only pages with no text layer come back with a note and are not charged. OCR is planned for a future version.

What does it return per page? One item per page with markdown and/or text + tables (depending on outputFormat), plus url, page, totalPages, and tableCount.

How is it priced? A small per-run start fee plus a per-page fee, charged only for pages with real content. Blank, scanned, and failed pages are never charged.

Can I use it for RAG? Yes, that is the point. The Markdown is clean and page-scoped, so you can chunk and embed it directly.

How is this different from parsing PDFs locally? You can parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

Run locally

$apify run

Deploy

apify login
apify push