Pricing

$1.00 / 1,000 web or pdf source processeds

Go to Apify Store

Website & PDF to RAG JSONL Crawler

Try for free

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

Pricing

$1.00 / 1,000 web or pdf source processeds

Rating

0.0

(0)

Developer

Orbiscribe Labs

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Why use this instead of a generic crawler?

Generic website crawlers often stop at HTML or hide PDF extraction failures. This Actor makes PDFs first-class sources, keeps a PDF inventory, and emits warnings when a file has no machine-readable text.

paste webpage and PDF URLs
keep the first crawl small with low live defaults
filter web paths with includeUrlPatterns
export MIXED_RAG_CHUNKS_JSONL for vector pipelines
inspect PDF_INVENTORY and PDF_WARNINGS before trusting the corpus

What you get

Dataset rows for web pages, PDF documents, and chunks.
Source type, discovered-from URL, Markdown, main text, content hash, word count, and extraction warnings.
Key-value outputs: RAG_CHUNKS_JSONL, MIXED_RAG_CHUNKS_JSONL, DOCUMENTS_JSONL, PDF_INVENTORY, PDF_WARNINGS, SOURCE_INVENTORY, MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

Build a knowledge base from product docs plus linked PDF manuals.
Convert vendor compliance pages and policy PDFs into one dataset.
Audit which PDFs were discovered and which lacked machine-readable text.
Export mixed-source JSONL for retrieval with source-type filtering.

Input

Provide startUrls, direct pdfUrls, or both. Keep discoverLinkedPdfs enabled to follow PDF links from fetched pages. maxPdfs is enforced globally across direct and discovered PDFs.

The default input runs a tiny live webpage and PDF sample:

{
  "startUrls": [{ "url": "https://docs.apify.com/academy/getting-started" }],
  "pdfUrls": [
    "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
  ],
  "includeUrlPatterns": ["/academy/"],
  "excludeUrlPatterns": [],
  "discoverLinkedPdfs": true,
  "maxPages": 1,
  "maxPdfs": 1,
  "dryRun": false
}

Use dryRun: true when you want bundled demo records without fetching live sources or calling custom pay-per-event charges.

Pricing

Recommended monetization: Pay per Event at $0.001 per web-pdf-rag-source.

That is $1 per 1,000 processed webpages or PDFs, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large crawls.

Limits and compliance

Public webpages and PDFs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. PDF extraction is for machine-readable text; OCR is not included in this MVP.

Fast Website to Markdown & RAG JSONL Crawler

orbiscribe/website-rag-dataset-builder

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

Orbiscribe Labs

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

PDF to Markdown RAG-Ready

hedelka/pdf-to-markdown-rag

Premium PDF scraper that preserves tables and structure. Optimized for RAG.

Dmitry Goncharov

Pdf API

vivid_astronaut/pdf

Fabio Suizu

Docs & Help Center to RAG JSONL

orbiscribe/docs-help-center-rag-snapshot

Paste a docs or help center URL and get clean Markdown, breadcrumbs, page records, and JSONL chunks for RAG.

Orbiscribe Labs

PDF Scraper

onidivo/pdf-scraper

Scrape and extract text from PDF links.

Onidivo Technologies

518

RAG Pipeline Scraper — Website to Markdown & JSONL

yuchiaoniu/rag-pipeline-scraper

Transform any website into clean Markdown and JSONL ready for RAG pipelines, vector databases (Pinecone, Weaviate, Chroma), and LLM training. Removes ads, navigation, and boilerplate automatically.

Niu Yuchiao

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.