Website & PDF to RAG JSONL Crawler
Pricing
$1.00 / 1,000 web or pdf source processeds
Website & PDF to RAG JSONL Crawler
Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.
Pricing
$1.00 / 1,000 web or pdf source processeds
Rating
0.0
(0)
Developer
Orbiscribe Labs
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
2 days ago
Last modified
Categories
Share
Use this Actor when the useful knowledge is split between webpages and linked PDF manuals, whitepapers, policy documents, reports, or help files.
It fetches public pages, optionally discovers linked PDFs, extracts machine-readable PDF text, and writes one mixed-source RAG dataset with source type and warnings.
Why use this instead of a generic crawler?
Generic website crawlers often stop at HTML or hide PDF extraction failures. This Actor makes PDFs first-class sources, keeps a PDF inventory, and emits warnings when a file has no machine-readable text.
- paste webpage and PDF URLs
- keep the first crawl small with low live defaults
- filter web paths with
includeUrlPatterns - export
MIXED_RAG_CHUNKS_JSONLfor vector pipelines - inspect
PDF_INVENTORYandPDF_WARNINGSbefore trusting the corpus
What you get
- Dataset rows for web pages, PDF documents, and chunks.
- Source type, discovered-from URL, Markdown, main text, content hash, word count, and extraction warnings.
- Key-value outputs:
RAG_CHUNKS_JSONL,MIXED_RAG_CHUNKS_JSONL,DOCUMENTS_JSONL,PDF_INVENTORY,PDF_WARNINGS,SOURCE_INVENTORY,MARKDOWN_BUNDLE,BUYER_BRIEF, andRUN_SUMMARY.
Common workflows
- Build a knowledge base from product docs plus linked PDF manuals.
- Convert vendor compliance pages and policy PDFs into one dataset.
- Audit which PDFs were discovered and which lacked machine-readable text.
- Export mixed-source JSONL for retrieval with source-type filtering.
Input
Provide startUrls, direct pdfUrls, or both. Keep discoverLinkedPdfs enabled to follow PDF links from fetched pages. maxPdfs is enforced globally across direct and discovered PDFs.
The default input runs a tiny live webpage and PDF sample:
{"startUrls": [{ "url": "https://docs.apify.com/academy/getting-started" }],"pdfUrls": ["https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf"],"includeUrlPatterns": ["/academy/"],"excludeUrlPatterns": [],"discoverLinkedPdfs": true,"maxPages": 1,"maxPdfs": 1,"dryRun": false}
Use dryRun: true when you want bundled demo records without fetching live
sources or calling custom pay-per-event charges.
Pricing
Recommended monetization: Pay per Event at $0.001 per web-pdf-rag-source.
That is $1 per 1,000 processed webpages or PDFs, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large crawls.
Limits and compliance
Public webpages and PDFs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. PDF extraction is for machine-readable text; OCR is not included in this MVP.