Docs & Help Center to RAG JSONL avatar

Docs & Help Center to RAG JSONL

Pricing

$1.00 / 1,000 docs page processeds

Go to Apify Store
Docs & Help Center to RAG JSONL

Docs & Help Center to RAG JSONL

Paste a docs or help center URL and get clean Markdown, breadcrumbs, page records, and JSONL chunks for RAG.

Pricing

$1.00 / 1,000 docs page processeds

Rating

0.0

(0)

Developer

Orbiscribe Labs

Orbiscribe Labs

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Use this Actor when you need a chatbot-ready snapshot of public documentation, a help center, or a developer docs site.

It crawls from known docs URLs, extracts readable article content, keeps breadcrumb-style context, detects common documentation platforms, and writes Markdown plus JSONL chunks.

Why use this instead of a generic crawler?

Generic website crawlers are useful when you need every reachable page. This Actor is for docs and help center ingestion, where the useful output is clean article text, stable chunks, breadcrumbs, and a quick page inventory.

  • paste a docs or help center URL
  • keep the crawl focused with includeUrlPatterns
  • start with a small live default run
  • export DOCS_RAG_CHUNKS_JSONL directly to a vector pipeline
  • pay a low flat page price instead of guessing compute-unit cost

What you get

  • Dataset rows for docs pages and chunks.
  • Clean Markdown, main text, headings, links, canonical URL, content hash, platform hint, and inferred breadcrumbs.
  • Key-value outputs: RAG_CHUNKS_JSONL, DOCS_RAG_CHUNKS_JSONL, DOC_PAGES, BREADCRUMB_INDEX, DOCS_URL_INVENTORY, DOCS_MARKDOWN_BUNDLE, BUYER_BRIEF, and RUN_SUMMARY.

Common workflows

  • Build a support bot from public help center articles.
  • Snapshot developer docs before loading them into a vector database.
  • Export docs pages with breadcrumb metadata for citations.
  • Schedule repeat docs crawls and compare artifacts downstream.

Input

Provide one or more docsUrls. Use maxDepth and sameDomainOnly to control crawl breadth. Use platformHint when you know the docs platform and want that stored with the output.

The default input runs a tiny live Apify docs sample:

{
"docsUrls": [
{ "url": "https://docs.apify.com/academy/getting-started" },
{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" },
{ "url": "https://docs.apify.com/academy/actor-marketing-playbook/actor-basics/actor-description" }
],
"includeUrlPatterns": ["/academy/"],
"excludeUrlPatterns": [],
"maxPages": 5,
"maxDepth": 1,
"sameDomainOnly": true,
"dryRun": false
}

Use dryRun: true when you want bundled demo records without crawling live pages or calling custom pay-per-event charges.

Pricing

Recommended monetization: Pay per Event at $0.001 per docs-rag-page.

That is $1 per 1,000 processed docs pages, plus normal Apify platform usage. When pay-per-event pricing is enabled, dry runs are uncharged and free-plan callers get the first 25 processed sources without this Actor's custom event charge. Users should still set Apify spending limits before large crawls.

Limits and compliance

Public docs only. This Actor does not bypass logins, paywalls, robots policies, or access controls. Output quality depends on the site HTML structure and navigation.