ChunkHive - Advanced Code Chunker avatar
ChunkHive - Advanced Code Chunker

Pricing

from $0.01 / 1,000 results

Go to Apify Store
ChunkHive - Advanced Code Chunker

ChunkHive - Advanced Code Chunker

Transform Git repositories and code files into structured, semantic chunks for AI training. AST-first parsing preserves module→class→function relationships with byte-level precision for RAG, embeddings, and agentic AI systems.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(1)

Developer

Ahmedullah

Ahmedullah

Maintained by Community

Actor stats

0

Bookmarked

4

Total users

1

Monthly active users

10 days ago

Last modified

Share

ChunkHive – Advanced Code Chunker for AI Systems

Semantic, hierarchical code chunking for embeddings, RAG, and agentic AI workflows

ChunkHive is a production-grade code chunking Actor that transforms raw source code into clean, structured, semantically accurate chunks designed for modern AI systems.

It uses AST-first parsing with structural enrichment to preserve real code meaning — not just text boundaries.


🚀 What problem does ChunkHive solve?

Most code chunking tools split files by:

  • line count
  • token limits
  • naive text heuristics

This breaks:

  • function boundaries
  • class ownership
  • parent–child relationships
  • semantic correctness

ChunkHive fixes this by treating code structure as the source of truth.


✨ What does ChunkHive do?

ChunkHive analyzes source code repositories or single files and produces a structured dataset of semantic code chunks with:

  • Correct module → class → function hierarchy
  • Explicit parent–child relationships
  • Precise byte-level and line-level spans
  • Clean AST metadata for symbols

Each chunk is emitted as a dataset row, ready for downstream AI pipelines.


🧠 How ChunkHive works (AST-first by design)

AST is the authority. Tree-sitter is enrichment.

  • Primary parser: Language AST
    Ensures semantic correctness and exact symbol boundaries
  • Enrichment & fallback: Tree-sitter
    Improves robustness across diverse, real-world repositories
  • Result: Structurally correct, production-grade chunks at scale

This hybrid approach combines correctness and resilience — something most chunkers lack.


🆚 ChunkHive vs typical code chunking approaches

CapabilityTypical ChunkersChunkHive
Chunking basisLine / token splitsAST semantic nodes
Structural correctness❌ Often broken✅ Guaranteed
Module → class → function hierarchy❌ Flattened✅ Preserved
Parent–child relationships❌ Lost✅ Explicit
Symbol metadata❌ Heuristic / missing✅ AST-derived
Byte-level spans❌ Rare✅ Yes
Line-level spans⚠️ Partial✅ Yes
Repo-scale robustness⚠️ Fragile✅ Production-grade
Identical schema across input modes❌ No✅ Yes
RAG-ready without post-processing❌ No✅ Yes
Agent training suitability❌ Weak✅ Strong

🎯 Who is ChunkHive for?

  • AI engineers building code-aware RAG systems
  • Agent developers (LangChain, CrewAI, AutoGen)
  • ML teams creating high-quality code training datasets
  • Platform teams preparing large repositories for semantic search
  • Researchers working on code understanding and embeddings

🔌 Input Modes (Choose One)

ChunkHive supports two mutually exclusive input modes.
You must use exactly one.


Best for:

  • Full GitHub / GitLab repositories
  • RAG datasets
  • Agentic frameworks (LangChain, CrewAI, AutoGen)
  • Production pipelines

How to use:

  1. Set Input moderepo
  2. Provide a public Git repository URL
  3. (Optional) Configure extensions and limits

Example:

repo_url: https://github.com/crewAIInc/crewAI
extensions: .py,.md
max_files: 500

What happens:

  • Repository is cloned
  • Files are parsed using AST-first logic
  • Semantic chunks are generated and stored in a dataset

✍️ Mode 2: Paste Code (Single File)

Best for:

  • Quick testing
  • Demos
  • Chunking a single file

How to use:

  1. Set Input modepaste
  2. Paste the contents of one file into code_text
  3. Leave repo_url empty

Example:

def hello_world():
print("Hello, World!")

Rules:

  • Paste only one file
  • Do not mix multiple files
  • File is treated as a virtual local file

Result:

  • Parsed exactly like a real repository file
  • Output schema is identical to repo mode

📤 Output

ChunkHive produces a dataset of semantic code chunks.

Each dataset row includes:

FieldDescription
chunk_idUnique chunk identifier
file_pathSource file path
chunk_typeModule / class / function
languageDetected language
codeChunk content
span.start_lineStart line
span.end_lineEnd line
ast.nameSymbol name
ast.symbol_typeSymbol type
hierarchy.depthNesting depth

The dataset can be downloaded in JSON, JSONL, CSV, or Excel formats.

🎯 Use Cases

Building code embedding datasets

Powering RAG systems

Training AI agents on real code structure

Static analysis pipelines

Documentation + code alignment

💰 Pricing

ChunkHive uses Apify’s Compute Unit (CU) pricing model.

Costs depend on:

Repository size

Number of files

Parsing complexity

For small repositories and single-file inputs, usage is minimal.

🆘 Support & Feedback

Found a bug or edge case? Open an issue on the Actor page.

Have a feature request or custom use case? Feedback is welcome.

ChunkHive is actively maintained and optimized for real-world AI workloads.