ChunkHive - Advanced Code Chunker
Pricing
from $0.01 / 1,000 results
ChunkHive - Advanced Code Chunker
Transform Git repositories and code files into structured, semantic chunks for AI training. AST-first parsing preserves module→class→function relationships with byte-level precision for RAG, embeddings, and agentic AI systems.
Pricing
from $0.01 / 1,000 results
Rating
5.0
(1)
Developer

Ahmedullah
Actor stats
0
Bookmarked
4
Total users
1
Monthly active users
10 days ago
Last modified
Categories
Share
ChunkHive – Advanced Code Chunker for AI Systems
Semantic, hierarchical code chunking for embeddings, RAG, and agentic AI workflows
ChunkHive is a production-grade code chunking Actor that transforms raw source code into clean, structured, semantically accurate chunks designed for modern AI systems.
It uses AST-first parsing with structural enrichment to preserve real code meaning — not just text boundaries.
🚀 What problem does ChunkHive solve?
Most code chunking tools split files by:
- line count
- token limits
- naive text heuristics
This breaks:
- function boundaries
- class ownership
- parent–child relationships
- semantic correctness
ChunkHive fixes this by treating code structure as the source of truth.
✨ What does ChunkHive do?
ChunkHive analyzes source code repositories or single files and produces a structured dataset of semantic code chunks with:
- Correct module → class → function hierarchy
- Explicit parent–child relationships
- Precise byte-level and line-level spans
- Clean AST metadata for symbols
Each chunk is emitted as a dataset row, ready for downstream AI pipelines.
🧠 How ChunkHive works (AST-first by design)
AST is the authority. Tree-sitter is enrichment.
- Primary parser: Language AST
Ensures semantic correctness and exact symbol boundaries - Enrichment & fallback: Tree-sitter
Improves robustness across diverse, real-world repositories - Result: Structurally correct, production-grade chunks at scale
This hybrid approach combines correctness and resilience — something most chunkers lack.
🆚 ChunkHive vs typical code chunking approaches
| Capability | Typical Chunkers | ChunkHive |
|---|---|---|
| Chunking basis | Line / token splits | AST semantic nodes |
| Structural correctness | ❌ Often broken | ✅ Guaranteed |
| Module → class → function hierarchy | ❌ Flattened | ✅ Preserved |
| Parent–child relationships | ❌ Lost | ✅ Explicit |
| Symbol metadata | ❌ Heuristic / missing | ✅ AST-derived |
| Byte-level spans | ❌ Rare | ✅ Yes |
| Line-level spans | ⚠️ Partial | ✅ Yes |
| Repo-scale robustness | ⚠️ Fragile | ✅ Production-grade |
| Identical schema across input modes | ❌ No | ✅ Yes |
| RAG-ready without post-processing | ❌ No | ✅ Yes |
| Agent training suitability | ❌ Weak | ✅ Strong |
🎯 Who is ChunkHive for?
- AI engineers building code-aware RAG systems
- Agent developers (LangChain, CrewAI, AutoGen)
- ML teams creating high-quality code training datasets
- Platform teams preparing large repositories for semantic search
- Researchers working on code understanding and embeddings
🔌 Input Modes (Choose One)
ChunkHive supports two mutually exclusive input modes.
You must use exactly one.
🗂️ Mode 1: Repository Chunking (Recommended)
Best for:
- Full GitHub / GitLab repositories
- RAG datasets
- Agentic frameworks (LangChain, CrewAI, AutoGen)
- Production pipelines
How to use:
- Set Input mode →
repo - Provide a public Git repository URL
- (Optional) Configure extensions and limits
Example:
repo_url: https://github.com/crewAIInc/crewAIextensions: .py,.mdmax_files: 500
What happens:
- Repository is cloned
- Files are parsed using AST-first logic
- Semantic chunks are generated and stored in a dataset
✍️ Mode 2: Paste Code (Single File)
Best for:
- Quick testing
- Demos
- Chunking a single file
How to use:
- Set Input mode →
paste - Paste the contents of one file into
code_text - Leave
repo_urlempty
Example:
def hello_world():print("Hello, World!")
Rules:
- Paste only one file
- Do not mix multiple files
- File is treated as a virtual local file
Result:
- Parsed exactly like a real repository file
- Output schema is identical to repo mode
📤 Output
ChunkHive produces a dataset of semantic code chunks.
Each dataset row includes:
| Field | Description |
|---|---|
| chunk_id | Unique chunk identifier |
| file_path | Source file path |
| chunk_type | Module / class / function |
| language | Detected language |
| code | Chunk content |
| span.start_line | Start line |
| span.end_line | End line |
| ast.name | Symbol name |
| ast.symbol_type | Symbol type |
| hierarchy.depth | Nesting depth |
The dataset can be downloaded in JSON, JSONL, CSV, or Excel formats.
🎯 Use Cases
Building code embedding datasets
Powering RAG systems
Training AI agents on real code structure
Static analysis pipelines
Documentation + code alignment
💰 Pricing
ChunkHive uses Apify’s Compute Unit (CU) pricing model.
Costs depend on:
Repository size
Number of files
Parsing complexity
For small repositories and single-file inputs, usage is minimal.
🆘 Support & Feedback
Found a bug or edge case? Open an issue on the Actor page.
Have a feature request or custom use case? Feedback is welcome.
ChunkHive is actively maintained and optimized for real-world AI workloads.
