Pricing

$1.00 / 1,000 docs page processeds

Go to Apify Store

Docs & Help Center to RAG JSONL

Try for free

Paste a docs or help center URL and get clean Markdown, breadcrumbs, page records, and JSONL chunks for RAG.

Pricing

$1.00 / 1,000 docs page processeds

Rating

0.0

(0)

Developer

Orbiscribe Labs

Actor stats

Bookmarked

Total users

Monthly active users

2 months ago

Last modified

Why use this instead of a generic crawler?

Generic website crawlers are useful when you need every reachable page. This Actor is for docs and help center ingestion, where the useful output is clean article text, stable chunks, breadcrumbs, and a quick page inventory.

paste a docs or help center URL
keep the crawl focused with includeUrlPatterns
start with a small live default run
export DOCS_RAG_CHUNKS_JSONL directly to a vector pipeline
pay a low flat page price instead of guessing compute-unit cost

What you get

Dataset rows for docs pages and chunks.
Clean Markdown, main text, headings, links, canonical URL, content hash, platform hint, and inferred breadcrumbs.
Key-value outputs: RAG_CHUNKS_JSONL, DOCS_RAG_CHUNKS_JSONL, DOC_PAGES, BREADCRUMB_INDEX, DOCS_URL_INVENTORY, DOCS_MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

Build a support bot from public help center articles.
Snapshot developer docs before loading them into a vector database.
Export docs pages with breadcrumb metadata for citations.
Schedule repeat docs crawls and compare artifacts downstream.

Input

Provide one or more docsUrls. Use maxDepth and sameDomainOnly to control crawl breadth. Use platformHint when you know the docs platform and want that stored with the output.

The default input runs a tiny live Apify docs sample:

{
  "docsUrls": [
    { "url": "https://docs.apify.com/academy/getting-started" },
    { "url": "https://docs.apify.com/academy/web-scraping-for-beginners" },
    { "url": "https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description" }
  ],
  "includeUrlPatterns": ["/academy/"],
  "excludeUrlPatterns": [],
  "maxPages": 5,
  "maxDepth": 1,
  "sameDomainOnly": true,
  "dryRun": false
}

Use dryRun: true when you want bundled demo records without crawling live pages or calling custom pay-per-event charges.

Pricing

Recommended monetization: Pay per Event at $0.001 per docs-rag-page.

That is $1 per 1,000 processed docs pages, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large crawls.

Limits and compliance

Public docs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. Output quality depends on the site HTML structure and navigation.

Fast Website to Markdown & RAG JSONL Crawler

orbiscribe/website-rag-dataset-builder

Paste a homepage or sitemap and get clean Markdown, metadata, JSONL chunks, and source URLs for RAG at a low per-page price.

Orbiscribe Labs

Website to RAG Markdown Crawler

knotted_tussock/rag-markdown-crawler

Crawl any website or docs site and export clean Markdown plus JSONL-style chunks for RAG, LLM apps, and AI agents.

Ralph T

Website Content Extractor

taroyamada/website-content-extractor

Extract clean text and markdown from docs, pricing, product, policy, and help-center URLs for RAG datasets and content operations.

naoki anzai

Website & PDF to RAG JSONL Crawler

orbiscribe/linked-pdf-website-rag-crawler

Paste webpage and PDF URLs and get Markdown, JSONL chunks, PDF inventory, source warnings, and RAG-ready records.

Orbiscribe Labs

Docs-to-RAG Optimizer

vamsi-krishna/docs-to-rag-optimizer

Convert public developer documentation into clean Markdown, semantic RAG chunks, token counts, duplicate hashes, JSONL exports, and quality warnings for AI assistants.

Vamsi Krishna

Docs Change Monitor for AI

careybrown/docs-change-rag-ready-monitor

Monitor public docs, changelogs, help centers, status pages, and pricing pages for changes, then output clean Markdown and RAG-ready chunks for AI knowledge bases.

Carey Brown

URL List to RAG & Vector JSONL

orbiscribe/url-list-to-vector-jsonl

Paste a curated URL list and get clean Markdown, document JSONL, vector chunks, ingest manifest, and failed URL report.

Orbiscribe Labs

Sitemap to Changed-Only RAG JSONL

orbiscribe/sitemap-to-rag-delta-dataset

Crawl sitemap.xml files and emit only added, changed, or deleted Markdown/JSONL chunks for cheaper RAG reindexing.

Orbiscribe Labs

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

Website to RAG Dataset

sebastian-actors/website-to-rag-dataset

Convert public websites, docs, blogs, and XML sitemaps into clean Markdown, structured metadata, and stable chunks for RAG pipelines and vector databases.