Website & PDF to RAG JSONL Crawler avatar

Website & PDF to RAG JSONL Crawler

Pricing

$1.00 / 1,000 web or pdf source processeds

Go to Apify Store
Website & PDF to RAG JSONL Crawler

Website & PDF to RAG JSONL Crawler

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

Pricing

$1.00 / 1,000 web or pdf source processeds

Rating

0.0

(0)

Developer

Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Use this Actor when the useful knowledge is split between webpages and linked PDF manuals, whitepapers, policy documents, reports, or help files.

It fetches public pages, optionally discovers linked PDFs, extracts machine-readable PDF text, and writes one mixed-source RAG dataset with source type and warnings.

Why use this instead of a generic crawler?

Generic website crawlers often stop at HTML or hide PDF extraction failures. This Actor makes PDFs first-class sources, keeps a PDF inventory, and emits warnings when a file has no machine-readable text.

  • paste webpage and PDF URLs
  • keep the first crawl small with low live defaults
  • filter web paths with includeUrlPatterns
  • export MIXED_RAG_CHUNKS_JSONL for vector pipelines
  • inspect PDF_INVENTORY and PDF_WARNINGS before trusting the corpus

What you get

  • Dataset rows for web pages, PDF documents, and chunks.
  • Source type, discovered-from URL, Markdown, main text, content hash, word count, and extraction warnings.
  • Key-value outputs: RAG_CHUNKS_JSONL, MIXED_RAG_CHUNKS_JSONL, DOCUMENTS_JSONL, PDF_INVENTORY, PDF_WARNINGS, SOURCE_INVENTORY, MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

  • Build a knowledge base from product docs plus linked PDF manuals.
  • Convert vendor compliance pages and policy PDFs into one dataset.
  • Audit which PDFs were discovered and which lacked machine-readable text.
  • Export mixed-source JSONL for retrieval with source-type filtering.

Input

Provide startUrls, direct pdfUrls, or both. Keep discoverLinkedPdfs enabled to follow PDF links from fetched pages. maxPdfs is enforced globally across direct and discovered PDFs.

The default input runs a tiny live webpage and PDF sample:

{
"startUrls": [{ "url": "https://docs.apify.com/academy/getting-started" }],
"pdfUrls": [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"
],
"includeUrlPatterns": ["/academy/"],
"excludeUrlPatterns": [],
"discoverLinkedPdfs": true,
"maxPages": 1,
"maxPdfs": 1,
"dryRun": false
}

Use dryRun: true when you want bundled demo records without fetching live sources or calling custom pay-per-event charges.

Pricing

Recommended monetization: Pay per Event at $0.001 per web-pdf-rag-source.

That is $1 per 1,000 processed webpages or PDFs, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large crawls.

Limits and compliance

Public webpages and PDFs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. PDF extraction is for machine-readable text; OCR is not included in this MVP.