Docs To Rag avatar
Docs To Rag

Pricing

Pay per usage

Go to Apify Store
Docs To Rag

Docs To Rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Gabriel Antony Xaviour

Gabriel Antony Xaviour

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

19 days ago

Last modified

Share

📚 DocsToRAG

Transform any documentation site into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

FeaturesQuick StartConfigurationOutputVector DBsUse Cases


What is DocsToRAG?

DocsToRAG crawls documentation websites and converts them into high-quality chunks optimized for Retrieval-Augmented Generation (RAG) systems. Unlike simple text splitters, it uses semantic chunking that preserves code blocks, lists, and document structure.

Key Benefits

FeatureDescription
🧠 Semantic ChunkingPreserves code blocks with explanations, keeps lists intact, respects document hierarchy
Quality ScoringAutomatically filters boilerplate, navigation, and low-value content
🔗 Vector DB IntegrationPush directly to Pinecone, Supabase, or Qdrant
📊 Rich MetadataContent classification, complexity levels, topic extraction, chunk relationships
Embeddings ReadyGenerate OpenAI embeddings in the same run

Features

Semantic Chunking

Traditional chunking splits text by character count, breaking code blocks and sentences mid-way. DocsToRAG understands document structure:

  • Preserves code blocks with their surrounding explanations
  • Keeps lists and paragraphs intact as logical units
  • Respects document hierarchy with H1/H2/H3 sections
  • Smart overlap using complete semantic blocks, not raw tokens

Quality Scoring

Every chunk receives a quality score (0-100) based on:

DimensionWhat It Measures
Information DensityRatio of unique meaningful terms
CompletenessProper sentence structure, punctuation
Code QualityLanguage specified, meaningful length
ReadabilitySentence length, clarity

Quality Flags identify issues like boilerplate, low_content, navigation_text.

Enhanced Metadata

Each chunk includes rich metadata for better retrieval:

{
"contentType": "tutorial",
"complexity": "beginner",
"topics": ["CheerioCrawler", "RequestQueue"],
"headingPath": "Quick Start > Installation",
"prevChunkId": "chunk_abc123",
"nextChunkId": "chunk_def456"
}

Quick Start

Basic Usage

Crawl a documentation site and output semantic chunks:

{
"startUrls": [{ "url": "https://docs.example.com" }],
"maxPages": 50,
"chunkingStrategy": "semantic",
"outputFormat": "jsonl"
}

With Quality Filter

Only output chunks scoring above 50:

{
"startUrls": [{ "url": "https://docs.example.com" }],
"maxPages": 100,
"chunkingStrategy": "semantic",
"enableQualityScoring": true,
"minQualityScore": 50
}

With Embeddings

Generate OpenAI embeddings for each chunk:

{
"startUrls": [{ "url": "https://docs.example.com" }],
"chunkingStrategy": "semantic",
"generateEmbeddings": true,
"openaiApiKey": "sk-...",
"embeddingModel": "text-embedding-3-small"
}

Full Pipeline (Crawl → Chunk → Embed → Store)

{
"startUrls": [{ "url": "https://docs.example.com" }],
"maxPages": 100,
"chunkingStrategy": "semantic",
"enableQualityScoring": true,
"minQualityScore": 40,
"generateEmbeddings": true,
"openaiApiKey": "sk-...",
"vectorDbProvider": "pinecone",
"vectorDbConfig": {
"apiKey": "your-pinecone-api-key",
"indexName": "your-index-name"
},
"vectorDbNamespace": "docs-v1"
}

Input Configuration

Crawling Options

ParameterTypeDefaultDescription
startUrlsarrayrequiredDocumentation URLs to crawl
maxDepthinteger10How many levels deep to crawl
maxPagesinteger1000Maximum pages to process
includeGlobsarrayOnly crawl URLs matching these patterns
excludeGlobsarraySkip URLs matching these patterns

Chunking Options

ParameterTypeDefaultDescription
chunkingStrategystringsemanticsemantic or simple
chunkSizeinteger500Target chunk size in tokens
chunkOverlapinteger50Overlap between chunks
splitByHeadersbooleantrueCreate new chunks at H1/H2 (simple mode)
includeCodeBlocksbooleantrueInclude code snippets

Quality Options

ParameterTypeDefaultDescription
enableQualityScoringbooleantrueEnable quality scoring
minQualityScoreinteger40Minimum quality score (0-100)
includeQualityInMetadatabooleantrueInclude scores in output
enrichMetadatabooleantrueAdd content classification

Embedding Options

ParameterTypeDefaultDescription
generateEmbeddingsbooleanfalseGenerate OpenAI embeddings
openaiApiKeystringYour OpenAI API key
embeddingModelstringtext-embedding-3-smallModel to use

Vector Database Options

ParameterTypeDefaultDescription
vectorDbProviderstringnonenone, pinecone, supabase, or qdrant
vectorDbConfigobjectProvider-specific configuration
vectorDbNamespacestringNamespace/collection for vectors
upsertBatchSizeinteger100Batch size for upserts

Output Options

ParameterTypeDefaultDescription
outputFormatstringjsonljson, jsonl, or csv

Output Schema

Chunk Structure

{
"id": "chunk_a1b2c3d4",
"text": "## Installation\n\nTo install the package, run:\n\n```bash\nnpm install crawlee\n```",
"tokenCount": 24,
"metadata": {
"sourceUrl": "https://crawlee.dev/docs/quick-start",
"title": "Quick Start",
"section": "Installation",
"breadcrumbs": ["Docs", "Quick Start"],
"chunkIndex": 2,
"totalChunks": 8,
"hierarchy": ["Quick Start", "Installation"],
"headingLevel": 2,
"hasCode": true,
"codeLanguages": ["bash"],
"contentType": "mixed",
"prevChunkId": "chunk_x1y2z3w4",
"nextChunkId": "chunk_e5f6g7h8",
"quality": {
"overall": 78,
"dimensions": {
"informationDensity": 72,
"completeness": 85,
"codeQuality": 80,
"readability": 75
},
"flags": []
}
},
"embedding": [0.123, -0.456, ...]
}

Run Summary (OUTPUT)

{
"summary": {
"totalPages": 45,
"totalChunks": 312,
"uniqueChunks": 287,
"avgChunkSize": 423,
"embeddingsGenerated": true,
"crawlDurationSec": 67,
"vectorDb": {
"provider": "pinecone",
"namespace": "docs-v1",
"upsertedCount": 287
},
"qualityStats": {
"avgScore": 71,
"scoreDistribution": {
"excellent": 89,
"good": 142,
"fair": 56,
"filtered": 25
}
}
}
}

Vector Databases

Pinecone

{
"vectorDbProvider": "pinecone",
"vectorDbConfig": {
"apiKey": "your-pinecone-api-key",
"indexName": "your-index-name"
},
"vectorDbNamespace": "docs-v1"
}

Supabase

{
"vectorDbProvider": "supabase",
"vectorDbConfig": {
"url": "https://your-project.supabase.co",
"anonKey": "your-anon-key",
"tableName": "documents"
}
}

Required Supabase table schema:

CREATE TABLE documents (
id TEXT PRIMARY KEY,
content TEXT,
metadata JSONB,
embedding VECTOR(1536)
);

Qdrant

{
"vectorDbProvider": "qdrant",
"vectorDbConfig": {
"url": "https://your-cluster.qdrant.io",
"apiKey": "your-qdrant-api-key",
"collectionName": "docs"
}
}

Environment Variables

Store API keys securely in Actor environment variables instead of input:

VariableDescription
OPENAI_API_KEYOpenAI API key for embeddings
PINECONE_API_KEYPinecone API key
PINECONE_INDEX_NAMEDefault Pinecone index
SUPABASE_URLSupabase project URL
SUPABASE_ANON_KEYSupabase anonymous key
QDRANT_URLQdrant cluster URL
QDRANT_API_KEYQdrant API key

With environment variables set, input simplifies to:

{
"startUrls": [{ "url": "https://docs.example.com" }],
"generateEmbeddings": true,
"vectorDbProvider": "pinecone"
}

Use Cases

Use CaseDescription
RAG ApplicationsBuild knowledge bases for AI assistants and chatbots
Documentation SearchCreate semantic search indexes for docs sites
Training DataPrepare high-quality documentation for fine-tuning
Content AnalysisAnalyze documentation quality across projects
Knowledge GraphsExtract structured information from docs

Cost Estimation

ComponentCost
Crawling~$0.001 per page (Apify compute)
Embeddings~$0.02 per 1M tokens (text-embedding-3-small)
Vector DBVaries by provider

Example: 100 pages → ~500 chunks → ~50K tokens → ~$0.10 total


FAQ

Q: What's the difference between semantic and simple chunking?

Simple chunking splits by character count with optional header breaks. Semantic chunking understands document structure—it keeps code blocks intact, preserves list items, and maintains paragraph coherence.

Q: How does quality scoring work?

Each chunk is scored 0-100 based on information density, completeness, code quality, and readability. Low-scoring content (boilerplate, navigation, cookie notices) is automatically filtered.

Q: Can I use my own embedding model?

Currently supports OpenAI embedding models. The embeddingModel parameter accepts any OpenAI embedding model ID.

Q: How do I handle large documentation sites?

Use includeGlobs and excludeGlobs to target specific sections. Set appropriate maxPages limits. Consider running multiple times with different namespaces for different doc sections.


Support

  • Issues: Report bugs or request features on GitHub
  • Documentation: See the full README in the Actor source
  • API: Use the Apify API to run this Actor programmatically

License

ISC