Pricing

from $0.01 / 1,000 results

Try for free

Go to Apify Store

ChunkHive - Advanced Code Chunker

Try for free

Transform Git repositories and code files into structured, semantic chunks for AI training. AST-first parsing preserves module→class→function relationships with byte-level precision for RAG, embeddings, and agentic AI systems.

Pricing

from $0.01 / 1,000 results

Rating

5.0

(1)

Developer

Ahmedullah

Actor stats

Bookmarked

Total users

Monthly active users

a month ago

Last modified

ChunkHive – Advanced Code Chunker for AI Systems

Semantic, hierarchical code chunking for embeddings, RAG, and agentic AI workflows

ChunkHive is a production-grade code chunking Actor that transforms raw source code into clean, structured, semantically accurate chunks designed for modern AI systems.

It uses AST-first parsing with structural enrichment to preserve real code meaning — not just text boundaries.

🚀 What problem does ChunkHive solve?

Most code chunking tools split files by:

line count
token limits
naive text heuristics

This breaks:

function boundaries
class ownership
parent–child relationships
semantic correctness

ChunkHive fixes this by treating code structure as the source of truth.

✨ What does ChunkHive do?

ChunkHive analyzes source code repositories or single files and produces a structured dataset of semantic code chunks with:

Correct module → class → function hierarchy
Explicit parent–child relationships
Precise byte-level and line-level spans
Clean AST metadata for symbols

Each chunk is emitted as a dataset row, ready for downstream AI pipelines.

🧠 How ChunkHive works (AST-first by design)

AST is the authority. Tree-sitter is enrichment.

Primary parser: Language AST
Ensures semantic correctness and exact symbol boundaries
Enrichment & fallback: Tree-sitter
Improves robustness across diverse, real-world repositories
Result: Structurally correct, production-grade chunks at scale

This hybrid approach combines correctness and resilience — something most chunkers lack.

🆚 ChunkHive vs typical code chunking approaches

Capability	Typical Chunkers	ChunkHive
Chunking basis	Line / token splits	AST semantic nodes
Structural correctness	❌ Often broken	✅ Guaranteed
Module → class → function hierarchy	❌ Flattened	✅ Preserved
Parent–child relationships	❌ Lost	✅ Explicit
Symbol metadata	❌ Heuristic / missing	✅ AST-derived
Byte-level spans	❌ Rare	✅ Yes
Line-level spans	⚠️ Partial	✅ Yes
Repo-scale robustness	⚠️ Fragile	✅ Production-grade
Identical schema across input modes	❌ No	✅ Yes
RAG-ready without post-processing	❌ No	✅ Yes
Agent training suitability	❌ Weak	✅ Strong

🎯 Who is ChunkHive for?

AI engineers building code-aware RAG systems
Agent developers (LangChain, CrewAI, AutoGen)
ML teams creating high-quality code training datasets
Platform teams preparing large repositories for semantic search
Researchers working on code understanding and embeddings

🔌 Input Modes (Choose One)

ChunkHive supports two mutually exclusive input modes.
You must use exactly one.

🗂️ Mode 1: Repository Chunking (Recommended)

Best for:

Full GitHub / GitLab repositories
RAG datasets
Agentic frameworks (LangChain, CrewAI, AutoGen)
Production pipelines

How to use:

Set Input mode → repo
Provide a public Git repository URL
(Optional) Configure extensions and limits

Example:

repo_url: https://github.com/crewAIInc/crewAI
extensions: .py,.md
max_files: 500

What happens:

Repository is cloned
Files are parsed using AST-first logic
Semantic chunks are generated and stored in a dataset

✍️ Mode 2: Paste Code (Single File)

Best for:

Quick testing
Demos
Chunking a single file

How to use:

Set Input mode → paste
Paste the contents of one file into code_text
Leave repo_url empty

Example:

def hello_world():
    print("Hello, World!")

Rules:

Paste only one file
Do not mix multiple files
File is treated as a virtual local file

Result:

Parsed exactly like a real repository file
Output schema is identical to repo mode

📤 Output

ChunkHive produces a dataset of semantic code chunks.

Each dataset row includes:

Field	Description
chunk_id	Unique chunk identifier
file_path	Source file path
chunk_type	Module / class / function
language	Detected language
code	Chunk content
span.start_line	Start line
span.end_line	End line
ast.name	Symbol name
ast.symbol_type	Symbol type
hierarchy.depth	Nesting depth

The dataset can be downloaded in JSON, JSONL, CSV, or Excel formats.

🎯 Use Cases

Building code embedding datasets

Powering RAG systems

Training AI agents on real code structure

Static analysis pipelines

Documentation + code alignment

💰 Pricing

ChunkHive uses Apify’s Compute Unit (CU) pricing model.

Costs depend on:

Repository size

Number of files

Parsing complexity

For small repositories and single-file inputs, usage is minimal.

🆘 Support & Feedback

Found a bug or edge case? Open an issue on the Actor page.

Have a feature request or custom use case? Feedback is welcome.

ChunkHive is actively maintained and optimized for real-world AI workloads.

Ai Code Review

vivid_astronaut/ai-code-review

Fabio Suizu

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

Mick

Docs To Rag

gabrielaxy/docs-to-rag

Transform documentation websites into RAG-ready chunks with semantic understanding, quality scoring, and direct vector database integration.

Gabriel Antony Xaviour

GitHub Code Search Scraper

brave_paradise/github-code-search-scraper

Searches GitHub code repositories using the GitHub Search API. Find code snippets, files, and repositories matching your query.

Donny

AI Training Data Scraper

blukaze/AI-Training-Data-Scraper

AI Training Data Scraper converts websites into clean, semantically-chunked, vector-ready data for LLMs, RAG pipelines, and AI search. Built for documentation, tutorials, and code-heavy content, with smart chunking and rich metadata.

Blukaze Automations

GitHub Code Dataset Builder

consummate_mandala/github-code-dataset-builder

Build code datasets from GitHub repos. Extract files by language, license, stars, and topics for code LLM training.

Donny Nguyen

GitHub README Globalizer

lenient_grove/GitHub-README-Globalizer

🌍 Translate GitHub docs (README, CONTRIBUTING, CHANGELOG) into 16+ languages without breaking code! 🛡️ AI-powered by Lingo.dev with AST parsing to protect code blocks. 📄 Multi-file, 🌿 branch selection, 📥 one-click downloads. Code stays safe!

Tejas Rawool

5.0

Website to Markdown Crawler â€” AI/RAG Data Pipeline

sovereigntaylor/website-to-markdown

Crawl any website and convert every page to clean, structured Markdown. Perfect for RAG pipelines, LLM training data, vector database ingestion, knowledge base building, and AI-powered search. Extracts main content, strips boilerplate, handles metadata, and chunks output for embeddings. Works with L

Ricardo Akiyoshi

Ai Training Data Curator

lanky_quantifier/ai-training-data-curator

Curate high-quality training datasets for AI/ML models. Extract, clean & format text data from websites, papers & forums. Perfect for LLM training, RAG systems & research.

Vhub Systems

Arxiv Semantic Search

draouadmohamed/arxiv-semantic-search

Scrape arXiv papers by category and find relevant research using AI-powered semantic search. Get papers from any field (AI, physics, biology, economics, etc.) with embeddings for RAG systems. Find your categories at: https://arxiv.org/category_taxonomy