Code Indexing

AST-based code indexing and semantic retrieval for project-aware context. The zeph-index crate parses source files via tree-sitter, chunks them by AST structure, embeds the chunks in Qdrant, and retrieves relevant code via hybrid search (semantic + grep routing) for injection into the agent context window.

zeph-index is always-on — no feature flag is required. Enable indexing at runtime via [index] enabled = true in config.

Why Code RAG

Cloud models with 200K token windows can afford multi-round agentic grep. Local models with 8K-32K windows cannot: a single grep cycle costs ~2K tokens (25% of an 8K budget), while 5 rounds would exceed the entire context. RAG retrieves 6-8 relevant chunks in ~3K tokens, preserving budget for history and response.

For cloud models, code RAG serves as pre-fill context alongside agentic search. For local models, it is the primary code retrieval mechanism.

Setup

Start Qdrant (required for vector storage):
```
docker compose up -d qdrant
```
Enable indexing in config:
```
[index]
enabled = true
```
Index your project:
```
zeph index
```
Or let auto-indexing handle it on startup when auto_index = true (default).

Architecture

The zeph-index crate contains 7 modules:

Module	Purpose
`languages`	Language detection from file extensions, tree-sitter grammar registry
`chunker`	AST-based chunking with greedy sibling merge (cAST-inspired algorithm)
`context`	Contextualized embedding text generation (file path + scope + imports + code)
`store`	Dual-write storage: Qdrant vectors + SQLite chunk metadata
`indexer`	Orchestrator: walk project tree, chunk files, embed, store with incremental change detection
`retriever`	Query classification, semantic search, budget-aware chunk packing
`repo_map`	Compact structural map of the project (signatures only, no function bodies)

Pipeline

Source files
    |
    v
[languages.rs] detect language, load grammar
    |
    v
[chunker.rs] parse AST, split into chunks (target: ~600 non-ws chars)
    |
    v
[context.rs] prepend file path, scope chain, imports, language tag
    |
    v
[indexer.rs] embed via LlmProvider, skip unchanged (content hash)
    |
    v
[store.rs] upsert to Qdrant (vectors) + SQLite (metadata)

Retrieval

User query
    |
    v
[retriever.rs] classify_query()
    |
    +--> Semantic  --> embed query --> Qdrant search --> budget pack --> inject
    |
    +--> Grep      --> return empty (agent uses bash tools)
    |
    +--> Hybrid    --> semantic search + hint to agent

Query Classification

The retriever classifies each query to route it to the appropriate search strategy:

Strategy	Trigger	Action
Grep	Exact symbols: `::`, `fn` , `struct` , CamelCase, snake_case identifiers	Agent handles via shell grep/ripgrep
Semantic	Conceptual queries: “how”, “where”, “why”, “explain”	Vector similarity search in Qdrant
Hybrid	Both symbol patterns and conceptual words	Semantic search + hint that grep may also help

Default (no pattern match): Semantic.

AST-Based Chunking

Files are parsed via tree-sitter into AST, then chunked by entity boundaries (functions, structs, classes, impl blocks). The algorithm uses greedy sibling merge:

Target size: 600 non-whitespace characters (~300-400 tokens)
Max size: 1200 non-ws chars (forced recursive split)
Min size: 100 non-ws chars (merge with adjacent sibling)

Config files (TOML, JSON, Markdown, Bash) are indexed as single file-level chunks since they lack named entities.

Each chunk carries rich metadata: file path, language, AST node type, entity name, line range, scope chain (e.g. MyStruct > impl MyStruct > my_method), imports, and a BLAKE3 content hash for change detection.

Contextualized Embeddings

Embedding raw code alone yields poor retrieval quality for conceptual queries. Before embedding, each chunk is prepended with:

File path (# src/agent.rs)
Scope chain (# Scope: Agent > prepare_context)
Language tag (# Language: rust)
First 5 import/use statements

This contextualized form improves retrieval for queries like “where is auth handled?” where the code alone might not contain the word “auth”.

Storage

Chunks are dual-written to two stores:

Store	Data	Purpose
Qdrant (`zeph_code_chunks`)	Embedding vectors + payload (code, metadata)	Semantic similarity search
SQLite (`chunk_metadata`)	File path, content hash, line range, language, node type	Change detection, cleanup of deleted files

The Qdrant collection uses INT8 scalar quantization for ~4x memory reduction with minimal accuracy loss. Payload indexes on language, file_path, and node_type enable filtered search.

Incremental Indexing

On subsequent runs, the indexer skips unchanged chunks by checking BLAKE3 content hashes in SQLite. Only modified or new files are re-embedded. Deleted files are detected by comparing the current file set against the SQLite index, and their chunks are removed from both stores.

File Watcher

When watch = true (default), an IndexWatcher monitors project files for changes during the session. On file modification, the changed file is automatically re-indexed via reindex_file() without rebuilding the entire index. The watcher uses 500ms debounce to batch rapid changes and only processes files with indexable extensions.

Disable with:

[index]
watch = false

Task Supervision

Code indexing integrates with the TaskSupervisor for observability of concurrent embedding operations. When embed_concurrency > 1, each chunk embedding is registered as a separate supervised task (chunk_file_{N}), making individual embedding progress visible in the TUI task registry and tracing systems.

Access the task registry via the TUI command palette:

Ctrl+P -> /tasks

This displays a live table of all supervised tasks, including:

Chunk embeddings: Individual file chunks being embedded
Background indexers: Automatic re-indexing of modified files
Refresh cycles: Periodic re-index operations

Each task shows: name, state (Running/Waiting), uptime since last restart, and restart count. This enables fine-grained debugging of indexing performance bottlenecks.

Repo Map

A lightweight structural map of the project generated via tree-sitter ts-query. Included in the system prompt and cached with a configurable TTL (default: 5 minutes) to avoid per-message filesystem traversal.

For each supported language, tree-sitter queries extract SymbolInfo records — name, kind (function, struct, class, impl, etc.), visibility (pub/private), and line number — directly from the AST. This replaces the previous heuristic regex approach and adds accurate multi-language support.

The repo map is injected unconditionally for all providers (Claude, OpenAI, Ollama, and others). Qdrant semantic retrieval remains provider-dependent and only runs when embeddings are available.

Example output:

<repo_map>
  src/agent.rs :: pub struct Agent (line 12), pub fn new (line 45), pub fn run (line 78), fn prepare_context (line 110)
  src/config.rs :: pub struct Config (line 5), pub fn load (line 30)
  src/main.rs :: pub fn main (line 1), fn setup_logging (line 15)
  ... and 12 more files
</repo_map>

The map is budget-constrained (default: 1024 tokens) and sorted by symbol count (files with more symbols appear first). It gives the model a structural overview of the project without consuming significant context.

LSP Hover Pre-filter

When the lsp-context feature is enabled, zeph-index pre-filters hover requests before forwarding them to the language server. Previously this filter used a Rust-only regex; it now uses tree-sitter to identify the symbol under the cursor for all supported languages (Rust, Python, JavaScript, TypeScript, Go).

The tree-sitter hover pre-filter:

Parses the file with the appropriate grammar.
Finds the AST node at the cursor position.
Walks up the tree to the nearest named symbol (identifier, field expression, call expression, etc.).
Passes the resolved symbol to the MCP LSP server for a hover lookup.

This makes hover-based context injection accurate across all indexed languages, not just Rust.

Budget-Aware Retrieval

Retrieved chunks are packed into a token budget (default: 40% of available context for code). Chunks are sorted by similarity score and greedily packed until the budget is exhausted. A minimum score threshold (default: 0.25) filters low-relevance results.

Retrieved code is injected as a transient <code_context> XML block before the conversation history. It is re-generated on every turn and never persisted.

Context Window Layout (with Code RAG)

When code indexing is enabled, the context window includes two additional sections:

+---------------------------------------------------+
| System prompt + environment + ZEPH.md             |
+---------------------------------------------------+
| <repo_map> (structural overview, cached)          |  <= 1024 tokens
+---------------------------------------------------+
| <available_skills>                                |
+---------------------------------------------------+
| <code_context> (per-query RAG chunks, transient)  |  <= 30% available
+---------------------------------------------------+
| [semantic recall] past messages                   |  <= 10% available
+---------------------------------------------------+
| Recent message history                            |  <= 50% available
+---------------------------------------------------+
| [response reserve]                                |  20% of total
+---------------------------------------------------+

Configuration

[index]
# Enable codebase indexing for semantic code search.
# Requires Qdrant running (uses separate collection "zeph_code_chunks").
enabled = false

# Auto-index on startup and re-index changed files during session.
auto_index = true

# Directories to index (relative to cwd).
paths = ["."]

# Patterns to exclude (in addition to .gitignore).
exclude = ["target", "node_modules", ".git", "vendor", "dist", "build", "__pycache__"]

# Token budget for repo map in system prompt (0 = no repo map).
repo_map_budget = 1024

# Cache TTL for repo map in seconds (avoids per-message regeneration).
repo_map_ttl_secs = 300

[index.chunker]
# Target chunk size in non-whitespace characters (~300-400 tokens).
target_size = 600
# Maximum chunk size before forced split.
max_size = 1200
# Minimum chunk size — smaller chunks merge with siblings.
min_size = 100

[index.retrieval]
# Maximum chunks to fetch from Qdrant (before budget packing).
max_chunks = 12
# Minimum cosine similarity score to accept.
score_threshold = 0.25
# Maximum fraction of available context budget for code chunks.
budget_ratio = 0.40

Automatic Code RAG Injection

When [index] is enabled with a Qdrant backend available and mcp_enabled = false, code context is automatically injected at context-assembly time. The retriever queries the code chunk collection using the current user message as the retrieval key, fetches the top-scoring chunks up to budget_ratio of the available context window, and appends them to the prompt as a <code_context> block.

Activation conditions:

[index] enabled = true
[index.retrieval] budget_ratio > 0
Qdrant is available and accessible
MCP tool exposure is disabled (mcp_enabled = false; when both are enabled, MCP tools take priority to avoid duplication)

Example context injection:

When you write “implement a cache invalidation function”, the agent’s context assembly:

Embeds “implement a cache invalidation function” using the configured embedding model
Queries Qdrant’s zeph_code_chunks collection for semantically relevant code
Fetches up to max_chunks = 12 results with score_threshold >= 0.25
Packs chunks into a <code_context> block (up to 40% of available tokens)
Injects the block into the prompt

The retrieval is fail-open: if embedding, Qdrant queries, or scoring errors occur, the injection is silently skipped and the turn continues. No special tooling is required from the agent.

Use budget_ratio = 0 to disable automatic injection while keeping the code index available for manual MCP tool queries via symbol_definition, find_text_references, etc.

Supported Languages

All tree-sitter grammars are compiled into every build. Language sub-features on zeph-index (lang-rust, lang-python, lang-js, lang-go, lang-config) are all enabled by default and cannot be individually disabled in the standard build.

Language	Feature	Extensions
Rust	`lang-rust`	`.rs`
Python	`lang-python`	`.py`, `.pyi`
JavaScript	`lang-js`	`.js`, `.jsx`, `.mjs`, `.cjs`
TypeScript	`lang-js`	`.ts`, `.tsx`, `.mts`, `.cts`
Go	`lang-go`	`.go`
Bash	`lang-config`	`.sh`, `.bash`, `.zsh`
TOML	`lang-config`	`.toml`
JSON	`lang-config`	`.json`, `.jsonc`
Markdown	`lang-config`	`.md`, `.markdown`

Environment Variables

Variable	Description	Default
`ZEPH_INDEX_ENABLED`	Enable code indexing	`false`
`ZEPH_INDEX_AUTO_INDEX`	Auto-index on startup	`true`
`ZEPH_INDEX_REPO_MAP_BUDGET`	Token budget for repo map	`1024`
`ZEPH_INDEX_REPO_MAP_TTL_SECS`	Cache TTL for repo map in seconds	`300`

Code Index as MCP Tools

When index.mcp_enabled = true, the code index is exposed as an in-process MCP server (IndexMcpServer) that registers four navigation tools directly into the tool executor pipeline. No JSON-RPC transport is involved — the tools run in-process alongside external MCP servers.

Exposed Tools

Tool	Input	Description
`symbol_definition`	`name: String`	Returns file path and line number for all definitions of a symbol (function, struct, enum, trait, etc.) found via tree-sitter AST
`find_text_references`	`name: String`	Textual search for references to a symbol across all indexed files; may include false positives from comments and strings
`call_graph`	`fn_name: String`	Returns a heuristic call graph rooted at the given function, derived from child symbol relationships in the AST
`module_summary`	`path: String`	Lists all symbols (name, kind, visibility, line number) defined in a given source file

How This Differs from Repo Map Injection

The repo map (repo_map_budget) is a static overview injected once per system prompt. It lists symbol names and locations but does not answer specific queries. The MCP tools are dynamic: the LLM calls them on demand to answer precise navigation questions, similar to IDE “go to definition” or “find references”. This is more token-efficient for targeted lookups and avoids injecting an entire structural overview when only one symbol matters.

Capability	Repo Map	MCP Tools
Always present in context	Yes	No (on-demand)
Find definition of one symbol	No	Yes (`symbol_definition`)
List all symbols in a file	No	Yes (`module_summary`)
Find all usages of a symbol	No	Yes (`find_text_references`)
Call chain from a function	No	Yes (`call_graph`)

Configuration

[index]
enabled     = true
mcp_enabled = true   # expose index as MCP tools

mcp_enabled defaults to false. Enabling it does not require Qdrant — the tool index is built directly from tree-sitter AST parsing and held in memory.

When to Use

Enable mcp_enabled for IDE-like workflows where the LLM needs to navigate the codebase interactively: tracing a call chain, checking where a struct is defined, or listing all symbols in a module. For large codebases where a full repo map would exceed the context budget, MCP tools provide targeted lookups without the token overhead.

The two mechanisms complement each other: repo map gives the model a high-level structural overview, and MCP tools let it drill into specific locations on demand.

Embedding Model Recommendations

The indexer uses the same LlmProvider.embed() as semantic memory. Any embedding model works. For code-heavy workloads:

Model	Dims	Notes
`qwen3-embedding`	1024	Current Zeph default, good general performance
`nomic-embed-text`	768	Lightweight universal model
`nomic-embed-code`	768	Optimized for code, higher RAM (~7.5GB)

Keyboard shortcuts

Zeph Documentation