Docs To Rag
Pricing
Pay per usage
Docs To Rag
Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.
Pricing
Pay per usage
Rating
0.0
(0)
Developer

Gabriel Antony Xaviour
Actor stats
0
Bookmarked
3
Total users
1
Monthly active users
19 days ago
Last modified
Categories
Share
📚 DocsToRAG
Transform any documentation site into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.
Features • Quick Start • Configuration • Output • Vector DBs • Use Cases
What is DocsToRAG?
DocsToRAG crawls documentation websites and converts them into high-quality chunks optimized for Retrieval-Augmented Generation (RAG) systems. Unlike simple text splitters, it uses semantic chunking that preserves code blocks, lists, and document structure.
Key Benefits
| Feature | Description |
|---|---|
| 🧠 Semantic Chunking | Preserves code blocks with explanations, keeps lists intact, respects document hierarchy |
| ⭐ Quality Scoring | Automatically filters boilerplate, navigation, and low-value content |
| 🔗 Vector DB Integration | Push directly to Pinecone, Supabase, or Qdrant |
| 📊 Rich Metadata | Content classification, complexity levels, topic extraction, chunk relationships |
| ⚡ Embeddings Ready | Generate OpenAI embeddings in the same run |
Features
Semantic Chunking
Traditional chunking splits text by character count, breaking code blocks and sentences mid-way. DocsToRAG understands document structure:
- Preserves code blocks with their surrounding explanations
- Keeps lists and paragraphs intact as logical units
- Respects document hierarchy with H1/H2/H3 sections
- Smart overlap using complete semantic blocks, not raw tokens
Quality Scoring
Every chunk receives a quality score (0-100) based on:
| Dimension | What It Measures |
|---|---|
| Information Density | Ratio of unique meaningful terms |
| Completeness | Proper sentence structure, punctuation |
| Code Quality | Language specified, meaningful length |
| Readability | Sentence length, clarity |
Quality Flags identify issues like boilerplate, low_content, navigation_text.
Enhanced Metadata
Each chunk includes rich metadata for better retrieval:
{"contentType": "tutorial","complexity": "beginner","topics": ["CheerioCrawler", "RequestQueue"],"headingPath": "Quick Start > Installation","prevChunkId": "chunk_abc123","nextChunkId": "chunk_def456"}
Quick Start
Basic Usage
Crawl a documentation site and output semantic chunks:
{"startUrls": [{ "url": "https://docs.example.com" }],"maxPages": 50,"chunkingStrategy": "semantic","outputFormat": "jsonl"}
With Quality Filter
Only output chunks scoring above 50:
{"startUrls": [{ "url": "https://docs.example.com" }],"maxPages": 100,"chunkingStrategy": "semantic","enableQualityScoring": true,"minQualityScore": 50}
With Embeddings
Generate OpenAI embeddings for each chunk:
{"startUrls": [{ "url": "https://docs.example.com" }],"chunkingStrategy": "semantic","generateEmbeddings": true,"openaiApiKey": "sk-...","embeddingModel": "text-embedding-3-small"}
Full Pipeline (Crawl → Chunk → Embed → Store)
{"startUrls": [{ "url": "https://docs.example.com" }],"maxPages": 100,"chunkingStrategy": "semantic","enableQualityScoring": true,"minQualityScore": 40,"generateEmbeddings": true,"openaiApiKey": "sk-...","vectorDbProvider": "pinecone","vectorDbConfig": {"apiKey": "your-pinecone-api-key","indexName": "your-index-name"},"vectorDbNamespace": "docs-v1"}
Input Configuration
Crawling Options
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Documentation URLs to crawl |
maxDepth | integer | 10 | How many levels deep to crawl |
maxPages | integer | 1000 | Maximum pages to process |
includeGlobs | array | — | Only crawl URLs matching these patterns |
excludeGlobs | array | — | Skip URLs matching these patterns |
Chunking Options
| Parameter | Type | Default | Description |
|---|---|---|---|
chunkingStrategy | string | semantic | semantic or simple |
chunkSize | integer | 500 | Target chunk size in tokens |
chunkOverlap | integer | 50 | Overlap between chunks |
splitByHeaders | boolean | true | Create new chunks at H1/H2 (simple mode) |
includeCodeBlocks | boolean | true | Include code snippets |
Quality Options
| Parameter | Type | Default | Description |
|---|---|---|---|
enableQualityScoring | boolean | true | Enable quality scoring |
minQualityScore | integer | 40 | Minimum quality score (0-100) |
includeQualityInMetadata | boolean | true | Include scores in output |
enrichMetadata | boolean | true | Add content classification |
Embedding Options
| Parameter | Type | Default | Description |
|---|---|---|---|
generateEmbeddings | boolean | false | Generate OpenAI embeddings |
openaiApiKey | string | — | Your OpenAI API key |
embeddingModel | string | text-embedding-3-small | Model to use |
Vector Database Options
| Parameter | Type | Default | Description |
|---|---|---|---|
vectorDbProvider | string | none | none, pinecone, supabase, or qdrant |
vectorDbConfig | object | — | Provider-specific configuration |
vectorDbNamespace | string | — | Namespace/collection for vectors |
upsertBatchSize | integer | 100 | Batch size for upserts |
Output Options
| Parameter | Type | Default | Description |
|---|---|---|---|
outputFormat | string | jsonl | json, jsonl, or csv |
Output Schema
Chunk Structure
{"id": "chunk_a1b2c3d4","text": "## Installation\n\nTo install the package, run:\n\n```bash\nnpm install crawlee\n```","tokenCount": 24,"metadata": {"sourceUrl": "https://crawlee.dev/docs/quick-start","title": "Quick Start","section": "Installation","breadcrumbs": ["Docs", "Quick Start"],"chunkIndex": 2,"totalChunks": 8,"hierarchy": ["Quick Start", "Installation"],"headingLevel": 2,"hasCode": true,"codeLanguages": ["bash"],"contentType": "mixed","prevChunkId": "chunk_x1y2z3w4","nextChunkId": "chunk_e5f6g7h8","quality": {"overall": 78,"dimensions": {"informationDensity": 72,"completeness": 85,"codeQuality": 80,"readability": 75},"flags": []}},"embedding": [0.123, -0.456, ...]}
Run Summary (OUTPUT)
{"summary": {"totalPages": 45,"totalChunks": 312,"uniqueChunks": 287,"avgChunkSize": 423,"embeddingsGenerated": true,"crawlDurationSec": 67,"vectorDb": {"provider": "pinecone","namespace": "docs-v1","upsertedCount": 287},"qualityStats": {"avgScore": 71,"scoreDistribution": {"excellent": 89,"good": 142,"fair": 56,"filtered": 25}}}}
Vector Databases
Pinecone
{"vectorDbProvider": "pinecone","vectorDbConfig": {"apiKey": "your-pinecone-api-key","indexName": "your-index-name"},"vectorDbNamespace": "docs-v1"}
Supabase
{"vectorDbProvider": "supabase","vectorDbConfig": {"url": "https://your-project.supabase.co","anonKey": "your-anon-key","tableName": "documents"}}
Required Supabase table schema:
CREATE TABLE documents (id TEXT PRIMARY KEY,content TEXT,metadata JSONB,embedding VECTOR(1536));
Qdrant
{"vectorDbProvider": "qdrant","vectorDbConfig": {"url": "https://your-cluster.qdrant.io","apiKey": "your-qdrant-api-key","collectionName": "docs"}}
Environment Variables
Store API keys securely in Actor environment variables instead of input:
| Variable | Description |
|---|---|
OPENAI_API_KEY | OpenAI API key for embeddings |
PINECONE_API_KEY | Pinecone API key |
PINECONE_INDEX_NAME | Default Pinecone index |
SUPABASE_URL | Supabase project URL |
SUPABASE_ANON_KEY | Supabase anonymous key |
QDRANT_URL | Qdrant cluster URL |
QDRANT_API_KEY | Qdrant API key |
With environment variables set, input simplifies to:
{"startUrls": [{ "url": "https://docs.example.com" }],"generateEmbeddings": true,"vectorDbProvider": "pinecone"}
Use Cases
| Use Case | Description |
|---|---|
| RAG Applications | Build knowledge bases for AI assistants and chatbots |
| Documentation Search | Create semantic search indexes for docs sites |
| Training Data | Prepare high-quality documentation for fine-tuning |
| Content Analysis | Analyze documentation quality across projects |
| Knowledge Graphs | Extract structured information from docs |
Cost Estimation
| Component | Cost |
|---|---|
| Crawling | ~$0.001 per page (Apify compute) |
| Embeddings | ~$0.02 per 1M tokens (text-embedding-3-small) |
| Vector DB | Varies by provider |
Example: 100 pages → ~500 chunks → ~50K tokens → ~$0.10 total
FAQ
Q: What's the difference between semantic and simple chunking?
Simple chunking splits by character count with optional header breaks. Semantic chunking understands document structure—it keeps code blocks intact, preserves list items, and maintains paragraph coherence.
Q: How does quality scoring work?
Each chunk is scored 0-100 based on information density, completeness, code quality, and readability. Low-scoring content (boilerplate, navigation, cookie notices) is automatically filtered.
Q: Can I use my own embedding model?
Currently supports OpenAI embedding models. The embeddingModel parameter accepts any OpenAI embedding model ID.
Q: How do I handle large documentation sites?
Use includeGlobs and excludeGlobs to target specific sections. Set appropriate maxPages limits. Consider running multiple times with different namespaces for different doc sections.
Support
- Issues: Report bugs or request features on GitHub
- Documentation: See the full README in the Actor source
- API: Use the Apify API to run this Actor programmatically
License
ISC