Code Indexing
AST-based code indexing and semantic retrieval for project-aware context. The zeph-index crate parses source files via tree-sitter, chunks them by AST structure, embeds the chunks in Qdrant, and retrieves relevant code via hybrid search (semantic + grep routing) for injection into the agent context window.
zeph-index is always-on — no feature flag is required. Enable indexing at runtime via [index] enabled = true in config.
Why Code RAG
Cloud models with 200K token windows can afford multi-round agentic grep. Local models with 8K-32K windows cannot: a single grep cycle costs ~2K tokens (25% of an 8K budget), while 5 rounds would exceed the entire context. RAG retrieves 6-8 relevant chunks in ~3K tokens, preserving budget for history and response.
For cloud models, code RAG serves as pre-fill context alongside agentic search. For local models, it is the primary code retrieval mechanism.
Setup
-
Start Qdrant (required for vector storage):
docker compose up -d qdrant -
Enable indexing in config:
[index] enabled = true -
Index your project:
zeph indexOr let auto-indexing handle it on startup when
auto_index = true(default).
Architecture
The zeph-index crate contains 7 modules:
| Module | Purpose |
|---|---|
languages | Language detection from file extensions, tree-sitter grammar registry |
chunker | AST-based chunking with greedy sibling merge (cAST-inspired algorithm) |
context | Contextualized embedding text generation (file path + scope + imports + code) |
store | Dual-write storage: Qdrant vectors + SQLite chunk metadata |
indexer | Orchestrator: walk project tree, chunk files, embed, store with incremental change detection |
retriever | Query classification, semantic search, budget-aware chunk packing |
repo_map | Compact structural map of the project (signatures only, no function bodies) |
Pipeline
Source files
|
v
[languages.rs] detect language, load grammar
|
v
[chunker.rs] parse AST, split into chunks (target: ~600 non-ws chars)
|
v
[context.rs] prepend file path, scope chain, imports, language tag
|
v
[indexer.rs] embed via LlmProvider, skip unchanged (content hash)
|
v
[store.rs] upsert to Qdrant (vectors) + SQLite (metadata)
Retrieval
User query
|
v
[retriever.rs] classify_query()
|
+--> Semantic --> embed query --> Qdrant search --> budget pack --> inject
|
+--> Grep --> return empty (agent uses bash tools)
|
+--> Hybrid --> semantic search + hint to agent
Query Classification
The retriever classifies each query to route it to the appropriate search strategy:
| Strategy | Trigger | Action |
|---|---|---|
| Grep | Exact symbols: ::, fn , struct , CamelCase, snake_case identifiers | Agent handles via shell grep/ripgrep |
| Semantic | Conceptual queries: “how”, “where”, “why”, “explain” | Vector similarity search in Qdrant |
| Hybrid | Both symbol patterns and conceptual words | Semantic search + hint that grep may also help |
Default (no pattern match): Semantic.
AST-Based Chunking
Files are parsed via tree-sitter into AST, then chunked by entity boundaries (functions, structs, classes, impl blocks). The algorithm uses greedy sibling merge:
- Target size: 600 non-whitespace characters (~300-400 tokens)
- Max size: 1200 non-ws chars (forced recursive split)
- Min size: 100 non-ws chars (merge with adjacent sibling)
Config files (TOML, JSON, Markdown, Bash) are indexed as single file-level chunks since they lack named entities.
Each chunk carries rich metadata: file path, language, AST node type, entity name, line range, scope chain (e.g. MyStruct > impl MyStruct > my_method), imports, and a BLAKE3 content hash for change detection.
Contextualized Embeddings
Embedding raw code alone yields poor retrieval quality for conceptual queries. Before embedding, each chunk is prepended with:
- File path (
# src/agent.rs) - Scope chain (
# Scope: Agent > prepare_context) - Language tag (
# Language: rust) - First 5 import/use statements
This contextualized form improves retrieval for queries like “where is auth handled?” where the code alone might not contain the word “auth”.
Storage
Chunks are dual-written to two stores:
| Store | Data | Purpose |
|---|---|---|
Qdrant (zeph_code_chunks) | Embedding vectors + payload (code, metadata) | Semantic similarity search |
SQLite (chunk_metadata) | File path, content hash, line range, language, node type | Change detection, cleanup of deleted files |
The Qdrant collection uses INT8 scalar quantization for ~4x memory reduction with minimal accuracy loss. Payload indexes on language, file_path, and node_type enable filtered search.
Incremental Indexing
On subsequent runs, the indexer skips unchanged chunks by checking BLAKE3 content hashes in SQLite. Only modified or new files are re-embedded. Deleted files are detected by comparing the current file set against the SQLite index, and their chunks are removed from both stores.
File Watcher
When watch = true (default), an IndexWatcher monitors project files for changes during the session. On file modification, the changed file is automatically re-indexed via reindex_file() without rebuilding the entire index. The watcher uses 1-second debounce to batch rapid changes and only processes files with indexable extensions.
Disable with:
[index]
watch = false
Repo Map
A lightweight structural map of the project generated via tree-sitter ts-query. Included in the system prompt and cached with a configurable TTL (default: 5 minutes) to avoid per-message filesystem traversal.
For each supported language, tree-sitter queries extract SymbolInfo records — name, kind (function, struct, class, impl, etc.), visibility (pub/private), and line number — directly from the AST. This replaces the previous heuristic regex approach and adds accurate multi-language support.
The repo map is injected unconditionally for all providers (Claude, OpenAI, Ollama, and others). Qdrant semantic retrieval remains provider-dependent and only runs when embeddings are available.
Example output:
<repo_map>
src/agent.rs :: pub struct Agent (line 12), pub fn new (line 45), pub fn run (line 78), fn prepare_context (line 110)
src/config.rs :: pub struct Config (line 5), pub fn load (line 30)
src/main.rs :: pub fn main (line 1), fn setup_logging (line 15)
... and 12 more files
</repo_map>
The map is budget-constrained (default: 1024 tokens) and sorted by symbol count (files with more symbols appear first). It gives the model a structural overview of the project without consuming significant context.
LSP Hover Pre-filter
When the lsp-context feature is enabled, zeph-index pre-filters hover requests before forwarding them to the language server. Previously this filter used a Rust-only regex; it now uses tree-sitter to identify the symbol under the cursor for all supported languages (Rust, Python, JavaScript, TypeScript, Go).
The tree-sitter hover pre-filter:
- Parses the file with the appropriate grammar.
- Finds the AST node at the cursor position.
- Walks up the tree to the nearest named symbol (identifier, field expression, call expression, etc.).
- Passes the resolved symbol to the MCP LSP server for a hover lookup.
This makes hover-based context injection accurate across all indexed languages, not just Rust.
Budget-Aware Retrieval
Retrieved chunks are packed into a token budget (default: 40% of available context for code). Chunks are sorted by similarity score and greedily packed until the budget is exhausted. A minimum score threshold (default: 0.25) filters low-relevance results.
Retrieved code is injected as a transient <code_context> XML block before the conversation history. It is re-generated on every turn and never persisted.
Context Window Layout (with Code RAG)
When code indexing is enabled, the context window includes two additional sections:
+---------------------------------------------------+
| System prompt + environment + ZEPH.md |
+---------------------------------------------------+
| <repo_map> (structural overview, cached) | <= 1024 tokens
+---------------------------------------------------+
| <available_skills> |
+---------------------------------------------------+
| <code_context> (per-query RAG chunks, transient) | <= 30% available
+---------------------------------------------------+
| [semantic recall] past messages | <= 10% available
+---------------------------------------------------+
| Recent message history | <= 50% available
+---------------------------------------------------+
| [response reserve] | 20% of total
+---------------------------------------------------+
Configuration
[index]
# Enable codebase indexing for semantic code search.
# Requires Qdrant running (uses separate collection "zeph_code_chunks").
enabled = false
# Auto-index on startup and re-index changed files during session.
auto_index = true
# Directories to index (relative to cwd).
paths = ["."]
# Patterns to exclude (in addition to .gitignore).
exclude = ["target", "node_modules", ".git", "vendor", "dist", "build", "__pycache__"]
# Token budget for repo map in system prompt (0 = no repo map).
repo_map_budget = 1024
# Cache TTL for repo map in seconds (avoids per-message regeneration).
repo_map_ttl_secs = 300
[index.chunker]
# Target chunk size in non-whitespace characters (~300-400 tokens).
target_size = 600
# Maximum chunk size before forced split.
max_size = 1200
# Minimum chunk size — smaller chunks merge with siblings.
min_size = 100
[index.retrieval]
# Maximum chunks to fetch from Qdrant (before budget packing).
max_chunks = 12
# Minimum cosine similarity score to accept.
score_threshold = 0.25
# Maximum fraction of available context budget for code chunks.
budget_ratio = 0.40
Supported Languages
All tree-sitter grammars are compiled into every build. Language sub-features on zeph-index (lang-rust, lang-python, lang-js, lang-go, lang-config) are all enabled by default and cannot be individually disabled in the standard build.
| Language | Feature | Extensions |
|---|---|---|
| Rust | lang-rust | .rs |
| Python | lang-python | .py, .pyi |
| JavaScript | lang-js | .js, .jsx, .mjs, .cjs |
| TypeScript | lang-js | .ts, .tsx, .mts, .cts |
| Go | lang-go | .go |
| Bash | lang-config | .sh, .bash, .zsh |
| TOML | lang-config | .toml |
| JSON | lang-config | .json, .jsonc |
| Markdown | lang-config | .md, .markdown |
Environment Variables
| Variable | Description | Default |
|---|---|---|
ZEPH_INDEX_ENABLED | Enable code indexing | false |
ZEPH_INDEX_AUTO_INDEX | Auto-index on startup | true |
ZEPH_INDEX_REPO_MAP_BUDGET | Token budget for repo map | 1024 |
ZEPH_INDEX_REPO_MAP_TTL_SECS | Cache TTL for repo map in seconds | 300 |
Code Index as MCP Tools
When index.mcp_enabled = true, the code index is exposed as an in-process MCP server (IndexMcpServer) that registers four navigation tools directly into the tool executor pipeline. No JSON-RPC transport is involved — the tools run in-process alongside external MCP servers.
Exposed Tools
| Tool | Input | Description |
|---|---|---|
symbol_definition | name: String | Returns file path and line number for all definitions of a symbol (function, struct, enum, trait, etc.) found via tree-sitter AST |
find_text_references | name: String | Textual search for references to a symbol across all indexed files; may include false positives from comments and strings |
call_graph | fn_name: String | Returns a heuristic call graph rooted at the given function, derived from child symbol relationships in the AST |
module_summary | path: String | Lists all symbols (name, kind, visibility, line number) defined in a given source file |
How This Differs from Repo Map Injection
The repo map (repo_map_budget) is a static overview injected once per system prompt. It lists symbol names and locations but does not answer specific queries. The MCP tools are dynamic: the LLM calls them on demand to answer precise navigation questions, similar to IDE “go to definition” or “find references”. This is more token-efficient for targeted lookups and avoids injecting an entire structural overview when only one symbol matters.
| Capability | Repo Map | MCP Tools |
|---|---|---|
| Always present in context | Yes | No (on-demand) |
| Find definition of one symbol | No | Yes (symbol_definition) |
| List all symbols in a file | No | Yes (module_summary) |
| Find all usages of a symbol | No | Yes (find_text_references) |
| Call chain from a function | No | Yes (call_graph) |
Configuration
[index]
enabled = true
mcp_enabled = true # expose index as MCP tools
mcp_enabled defaults to false. Enabling it does not require Qdrant — the tool index is built directly from tree-sitter AST parsing and held in memory.
When to Use
Enable mcp_enabled for IDE-like workflows where the LLM needs to navigate the codebase interactively: tracing a call chain, checking where a struct is defined, or listing all symbols in a module. For large codebases where a full repo map would exceed the context budget, MCP tools provide targeted lookups without the token overhead.
The two mechanisms complement each other: repo map gives the model a high-level structural overview, and MCP tools let it drill into specific locations on demand.
Embedding Model Recommendations
The indexer uses the same LlmProvider.embed() as semantic memory. Any embedding model works. For code-heavy workloads:
| Model | Dims | Notes |
|---|---|---|
qwen3-embedding | 1024 | Current Zeph default, good general performance |
nomic-embed-text | 768 | Lightweight universal model |
nomic-embed-code | 768 | Optimized for code, higher RAM (~7.5GB) |