Docs To Rag

Pricing

Pay per usage

Docs To Rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Gabriel Antony Xaviour

Maintained by Community

Actor stats

Bookmarked

Total users

Monthly active users

19 days ago

Last modified

📚 DocsToRAG

Transform any documentation site into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Features • Quick Start • Configuration • Output • Vector DBs • Use Cases

What is DocsToRAG?

DocsToRAG crawls documentation websites and converts them into high-quality chunks optimized for Retrieval-Augmented Generation (RAG) systems. Unlike simple text splitters, it uses semantic chunking that preserves code blocks, lists, and document structure.

Key Benefits

Feature	Description
🧠 Semantic Chunking	Preserves code blocks with explanations, keeps lists intact, respects document hierarchy
⭐ Quality Scoring	Automatically filters boilerplate, navigation, and low-value content
🔗 Vector DB Integration	Push directly to Pinecone, Supabase, or Qdrant
📊 Rich Metadata	Content classification, complexity levels, topic extraction, chunk relationships
⚡ Embeddings Ready	Generate OpenAI embeddings in the same run

Features

Semantic Chunking

Traditional chunking splits text by character count, breaking code blocks and sentences mid-way. DocsToRAG understands document structure:

Preserves code blocks with their surrounding explanations
Keeps lists and paragraphs intact as logical units
Respects document hierarchy with H1/H2/H3 sections
Smart overlap using complete semantic blocks, not raw tokens

Quality Scoring

Every chunk receives a quality score (0-100) based on:

Dimension	What It Measures
Information Density	Ratio of unique meaningful terms
Completeness	Proper sentence structure, punctuation
Code Quality	Language specified, meaningful length
Readability	Sentence length, clarity

Quality Flags identify issues like boilerplate, low_content, navigation_text.

Enhanced Metadata

Each chunk includes rich metadata for better retrieval:

{
  "contentType": "tutorial",
  "complexity": "beginner",
  "topics": ["CheerioCrawler", "RequestQueue"],
  "headingPath": "Quick Start > Installation",
  "prevChunkId": "chunk_abc123",
  "nextChunkId": "chunk_def456"
}

Quick Start

Basic Usage

Crawl a documentation site and output semantic chunks:

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "maxPages": 50,
  "chunkingStrategy": "semantic",
  "outputFormat": "jsonl"
}

With Quality Filter

Only output chunks scoring above 50:

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "maxPages": 100,
  "chunkingStrategy": "semantic",
  "enableQualityScoring": true,
  "minQualityScore": 50
}

With Embeddings

Generate OpenAI embeddings for each chunk:

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "chunkingStrategy": "semantic",
  "generateEmbeddings": true,
  "openaiApiKey": "sk-...",
  "embeddingModel": "text-embedding-3-small"
}

Full Pipeline (Crawl → Chunk → Embed → Store)

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "maxPages": 100,
  "chunkingStrategy": "semantic",
  "enableQualityScoring": true,
  "minQualityScore": 40,
  "generateEmbeddings": true,
  "openaiApiKey": "sk-...",
  "vectorDbProvider": "pinecone",
  "vectorDbConfig": {
    "apiKey": "your-pinecone-api-key",
    "indexName": "your-index-name"
  },
  "vectorDbNamespace": "docs-v1"
}

Input Configuration

Crawling Options

Parameter	Type	Default	Description
`startUrls`	array	required	Documentation URLs to crawl
`maxDepth`	integer	10	How many levels deep to crawl
`maxPages`	integer	1000	Maximum pages to process
`includeGlobs`	array	—	Only crawl URLs matching these patterns
`excludeGlobs`	array	—	Skip URLs matching these patterns

Chunking Options

Parameter	Type	Default	Description
`chunkingStrategy`	string	`semantic`	`semantic` or `simple`
`chunkSize`	integer	500	Target chunk size in tokens
`chunkOverlap`	integer	50	Overlap between chunks
`splitByHeaders`	boolean	true	Create new chunks at H1/H2 (simple mode)
`includeCodeBlocks`	boolean	true	Include code snippets

Quality Options

Parameter	Type	Default	Description
`enableQualityScoring`	boolean	true	Enable quality scoring
`minQualityScore`	integer	40	Minimum quality score (0-100)
`includeQualityInMetadata`	boolean	true	Include scores in output
`enrichMetadata`	boolean	true	Add content classification

Embedding Options

Parameter	Type	Default	Description
`generateEmbeddings`	boolean	false	Generate OpenAI embeddings
`openaiApiKey`	string	—	Your OpenAI API key
`embeddingModel`	string	`text-embedding-3-small`	Model to use

Vector Database Options

Parameter	Type	Default	Description
`vectorDbProvider`	string	`none`	`none`, `pinecone`, `supabase`, or `qdrant`
`vectorDbConfig`	object	—	Provider-specific configuration
`vectorDbNamespace`	string	—	Namespace/collection for vectors
`upsertBatchSize`	integer	100	Batch size for upserts

Output Options

Parameter	Type	Default	Description
`outputFormat`	string	`jsonl`	`json`, `jsonl`, or `csv`

Output Schema

Chunk Structure

{
  "id": "chunk_a1b2c3d4",
  "text": "## Installation\n\nTo install the package, run:\n\n```bash\nnpm install crawlee\n```",
  "tokenCount": 24,
  "metadata": {
    "sourceUrl": "https://crawlee.dev/docs/quick-start",
    "title": "Quick Start",
    "section": "Installation",
    "breadcrumbs": ["Docs", "Quick Start"],
    "chunkIndex": 2,
    "totalChunks": 8,
    "hierarchy": ["Quick Start", "Installation"],
    "headingLevel": 2,
    "hasCode": true,
    "codeLanguages": ["bash"],
    "contentType": "mixed",
    "prevChunkId": "chunk_x1y2z3w4",
    "nextChunkId": "chunk_e5f6g7h8",
    "quality": {
      "overall": 78,
      "dimensions": {
        "informationDensity": 72,
        "completeness": 85,
        "codeQuality": 80,
        "readability": 75
      },
      "flags": []
    }
  },
  "embedding": [0.123, -0.456, ...]
}

Run Summary (OUTPUT)

{
  "summary": {
    "totalPages": 45,
    "totalChunks": 312,
    "uniqueChunks": 287,
    "avgChunkSize": 423,
    "embeddingsGenerated": true,
    "crawlDurationSec": 67,
    "vectorDb": {
      "provider": "pinecone",
      "namespace": "docs-v1",
      "upsertedCount": 287
    },
    "qualityStats": {
      "avgScore": 71,
      "scoreDistribution": {
        "excellent": 89,
        "good": 142,
        "fair": 56,
        "filtered": 25
      }
    }
  }
}

Vector Databases

Pinecone

{
  "vectorDbProvider": "pinecone",
  "vectorDbConfig": {
    "apiKey": "your-pinecone-api-key",
    "indexName": "your-index-name"
  },
  "vectorDbNamespace": "docs-v1"
}

Supabase

{
  "vectorDbProvider": "supabase",
  "vectorDbConfig": {
    "url": "https://your-project.supabase.co",
    "anonKey": "your-anon-key",
    "tableName": "documents"
  }
}

Required Supabase table schema:

CREATE TABLE documents (
  id TEXT PRIMARY KEY,
  content TEXT,
  metadata JSONB,
  embedding VECTOR(1536)
);

Qdrant

{
  "vectorDbProvider": "qdrant",
  "vectorDbConfig": {
    "url": "https://your-cluster.qdrant.io",
    "apiKey": "your-qdrant-api-key",
    "collectionName": "docs"
  }
}

Environment Variables

Store API keys securely in Actor environment variables instead of input:

Variable	Description
`OPENAI_API_KEY`	OpenAI API key for embeddings
`PINECONE_API_KEY`	Pinecone API key
`PINECONE_INDEX_NAME`	Default Pinecone index
`SUPABASE_URL`	Supabase project URL
`SUPABASE_ANON_KEY`	Supabase anonymous key
`QDRANT_URL`	Qdrant cluster URL
`QDRANT_API_KEY`	Qdrant API key

With environment variables set, input simplifies to:

{
  "startUrls": [{ "url": "https://docs.example.com" }],
  "generateEmbeddings": true,
  "vectorDbProvider": "pinecone"
}

Use Cases

Use Case	Description
RAG Applications	Build knowledge bases for AI assistants and chatbots
Documentation Search	Create semantic search indexes for docs sites
Training Data	Prepare high-quality documentation for fine-tuning
Content Analysis	Analyze documentation quality across projects
Knowledge Graphs	Extract structured information from docs

Cost Estimation

Component	Cost
Crawling	~$0.001 per page (Apify compute)
Embeddings	~$0.02 per 1M tokens (text-embedding-3-small)
Vector DB	Varies by provider

Example: 100 pages → ~500 chunks → ~50K tokens → ~$0.10 total

FAQ

Q: What's the difference between semantic and simple chunking?

Simple chunking splits by character count with optional header breaks. Semantic chunking understands document structure—it keeps code blocks intact, preserves list items, and maintains paragraph coherence.

Q: How does quality scoring work?

Each chunk is scored 0-100 based on information density, completeness, code quality, and readability. Low-scoring content (boilerplate, navigation, cookie notices) is automatically filtered.

Q: Can I use my own embedding model?

Currently supports OpenAI embedding models. The embeddingModel parameter accepts any OpenAI embedding model ID.

Q: How do I handle large documentation sites?

Use includeGlobs and excludeGlobs to target specific sections. Set appropriate maxPages limits. Consider running multiple times with different namespaces for different doc sections.

Support

Issues: Report bugs or request features on GitHub
Documentation: See the full README in the Actor source
API: Use the Apify API to run this Actor programmatically

License

ISC

AI Sandbox

apify/ai-sandbox

This Actor provides a secure execution environment for code generated by AI agents. You can interact with the sandbox through shell, REST API, MCP, and web UI.

Apify

5.0

🔥 Power Data Transformer

wiseek/power-data-transformer

🔥 Unlock your scraped data—clean, merge, split, deduplicate, filter, standardize, validate, enrich and sync—using built-in transformations and powerful SQL pipelines for ETL/ELT workflows. Seamlessly integrate processed datasets with automation platforms like n8n, Make.com, and Zapier.

wiseek

Jungle Job Scraper 🌴

easyapi/jungle-job-scraper

Efficiently scrape job listings from Welcome to the Jungle with comprehensive details including salaries, company information, and benefits. Perfect for recruitment analysis, market research, and job market monitoring. 🌴✨

EasyApi

5.0

🔥 Welcome To The Jungle Jobs Scraper

bebity/welcome-to-the-jungle-jobs-scraper

ℹ️ Retrieve jobs, articles, & organizations from "Welcome to the Jungle" using this Actor. Precision meets ease in this modern data tool. Perfect for recruiters, jobseekers, & researchers. Your key to the latest job market insights.

Bebity

191

Welcome to the jungle scraper

saswave/welcome-to-the-jungle-scraper

Welcome to the jungle scraper. Retrieve jobs, companies from welcometothejungle.com website and extract search results informations. Helps for Intent based marketing campaigns. Extract valuable data: social networks, website, tech stack, job count ... jobtitle, salary, benefits, remote, date ...

SASWAVE

Welcome To The Jungle Job Scraper

runtime/welcome-to-the-jungle-job-scraper

Efficient job scraping actor: harvest top job offers and details from leading boards with precision and speed.

scraping automation

5.0

Welcometothejungle Jobs Scraper

orgupdate/welcometothejungle-jobs-scraper

The latest and most advanced Welcometothejungle Jobs Scraper. Our Welcometothejungle Jobs Scraper extracts real-time job postings at scale from all over the world. A new research tool built for recruitment, insights and HR.

Orgupdate

5.0

RAG Knowledge Loader

botflowtech/rag-knowledge-loader

Scrapes documentation sites (GitBook, ReadTheDocs, Notion public pages) and converts them into vector-ready JSON format for RAG applications.

BotFlowTech

Welcome To The Jungle Jobs Scraper

shahidirfan/Jungle-Job-Scraper

Extract job listings effortlessly with the Jungle Job Scraper. This lightweight actor is designed for fast and efficient data extraction from Jungle. For optimal stability and to avoid blocking, using residential proxies is highly recommended. Start scraping today!

Shahid Irfan

Rag Knowledge Graph Builder

cspnair/rag-knowledge-graph-builder

Transform websites into RAG-ready datasets. Crawls pages, chunks content into semantic segments (500-1000 tokens), and generates hypothetical questions for each chunk. No API key needed with native mode. Output: pre-indexed JSON optimized for AI retrieval with 3x better accuracy than raw text.

csp

5.0

Docs To Rag

Docs To Rag

📚 DocsToRAG

What is DocsToRAG?

Key Benefits

Features

Semantic Chunking

Quality Scoring

Enhanced Metadata

Quick Start

Basic Usage

With Quality Filter

With Embeddings

Full Pipeline (Crawl → Chunk → Embed → Store)

Input Configuration

Crawling Options

Chunking Options

Quality Options

Embedding Options

Vector Database Options

Output Options

Output Schema

Chunk Structure

Run Summary (OUTPUT)

Vector Databases

Pinecone

Supabase

Qdrant

Environment Variables

Use Cases

Cost Estimation

FAQ

Support

License

You might also like

AI Sandbox

🔥 Power Data Transformer

Jungle Job Scraper 🌴

🔥 Welcome To The Jungle Jobs Scraper

Welcome to the jungle scraper

Welcome To The Jungle Job Scraper

Welcometothejungle Jobs Scraper

RAG Knowledge Loader

Welcome To The Jungle Jobs Scraper

Rag Knowledge Graph Builder