Pricing

from $2.00 / 1,000 page processeds

PDF to Markdown & JSON (RAG-Ready)

Convert PDFs to clean Markdown and structured JSON (text + tables) for RAG, LLMs, and vector DBs. Batch URLs, pay per page.

Pricing

from $2.00 / 1,000 page processeds

Rating

0.0

(0)

Developer

BasisWeb

Actor stats

Bookmarked

Total users

Monthly active users

24 days ago

Last modified

What it does

Downloads each PDF by URL.
Extracts text using the PDF's character layout (natural reading order for standard single-column pages) and detects tables, rendering them as Markdown tables and structured rows.
Returns one dataset item per page. url, page, totalPages, tableCount, and ok are always present; markdown and/or text + tables are included depending on outputFormat (default both returns all of them).

Use cases

RAG ingestion: turn reports, manuals, and whitepapers into clean, page-level Markdown chunks for a vector database.
LLM document Q&A: feed structured text and tables to an LLM without copy-paste cleanup.
Extract tables from PDF: pull tables out as both Markdown and structured rows.
Agent pipelines: chain it after a web crawler so an AI agent can read the PDFs it finds.

Input

Field	Type	Default	Description
`pdfUrls`	array of URLs	(required)	Direct links to the PDFs to convert.
`extractTables`	boolean	`true`	Detect tables and render them as Markdown + structured rows.
`outputFormat`	`markdown` \| `json` \| `both`	`both`	What each result includes.

Example input

{
  "pdfUrls": ["https://example.com/report.pdf"],
  "extractTables": true,
  "outputFormat": "both"
}

Example output (one item per page)

{
  "url": "https://example.com/report.pdf",
  "page": 1,
  "totalPages": 12,
  "markdown": "Q3 Report\nRevenue grew 18% YoY...\n\n| Region | Revenue |\n| --- | --- |\n| NA | $4.1M |",
  "text": "Q3 Report\nRevenue grew 18% YoY...",
  "tables": [[["Region", "Revenue"], ["NA", "$4.1M"]]],
  "tableCount": 1,
  "ok": true
}

The example above uses the default outputFormat: "both", so it includes every field. Each item also includes ok (set to false on a failed URL or page, with an error field explaining why) and, on pages with no extractable text or tables, a note flag.

Pricing (pay-per-event)

Run start: a small flat fee per run (Apify's built-in start event).
Page processed: charged per page that returns real content (text and/or tables).

Pages with no extractable text or tables are returned with a note and are NOT charged. Failed URLs and failed pages are reported with an error and are never charged.

Your spending limit is always respected: set a max cost per run and the Actor stops once it's reached.

Use it with AI agents (Apify MCP)

This Actor is available as a tool for AI agents through Apify's MCP server (mcp.apify.com). An agent can call basisweb/pdf-to-markdown-rag to convert a PDF to Markdown mid-task, then chain the page-level output into the next step. The only required input is pdfUrls, so an agent can invoke it in one shot, and the output schema tells the agent exactly which fields come back (markdown, text, tables, tableCount) before it spends a credit.

Honest notes

This handles digital, text-based PDFs. Scanned PDFs (image-only, no text layer) are not OCR'd in this version; those pages come back with a note instead of text and are not charged. OCR is planned for a future version.
Each PDF must be under 50 MB. Very large or table-heavy PDFs run best at 2 GB memory or higher (the 50 MB limit caps file size, not parsing memory).
You can also parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

FAQ

Does it work on scanned PDFs? Not in this version. Image-only pages with no text layer come back with a note and are not charged. OCR is planned for a future version.

What does it return per page? One item per page with markdown and/or text + tables (depending on outputFormat), plus url, page, totalPages, and tableCount.

How is it priced? A small per-run start fee plus a per-page fee, charged only for pages with real content. Blank, scanned, and failed pages are never charged.

Can I use it for RAG? Yes, that is the point. The Markdown is clean and page-scoped, so you can chunk and embed it directly.

How is this different from parsing PDFs locally? You can parse PDFs locally for free with open-source libraries. This is the no-setup, hosted, pay-per-page version for pipelines that just want it as an API.

Run locally

$apify run

Deploy

apify login
apify push

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

PDF to Markdown for RAG & LLMs (tables, no API key needed)

vivid_astronaut/pdf-to-markdown-for-rag

Convert PDFs to clean structured Markdown for RAG and LLM ingestion. Preserves tables. Flat $0.004/page, no API key required.

Fabio Suizu

Website to Markdown for LLMs and RAG

rodrgds/website-to-markdown

Convert webpages into clean markdown for LLMs, RAG pipelines, AI datasets, archives, and content extraction. Simple pay-per-page pricing.

Rodrigo Dias

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Nguyễn Anh Duy

4.7

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Dmitry Goncharov

PDF to Text API | Document Extraction for LLMs & RAG

andok/pdf-text-converter

Convert bulk PDF documents via URL into clean, raw text. The perfect document scraper for LLMs, vector databases, and RAG pipelines.

Andok

Website to Markdown for LLMs

agentictools/website-to-markdown-llm

Crawl a site and export clean Markdown with token counts and chunks, ready for RAG.

Ken Agland

Website Content Crawler - Markdown & Text for LLM / RAG

pear_fight/website-content-crawler-markdown-text-for-llm-rag

Crawl any website and extract clean article text and Markdown, ready to feed into LLMs, ChatGPT, vector databases and RAG pipelines. Removes navigation, ads and boilerplate. Configurable crawl depth and page limits. Export to JSON, CSV, Excel.