Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Code Indexing

AST-based code indexing and semantic retrieval for project-aware context. The zeph-index crate parses source files via tree-sitter, chunks them by AST structure, embeds the chunks in Qdrant, and retrieves relevant code via hybrid search (semantic + grep routing) for injection into the agent context window.

zeph-index is always-on — no feature flag is required. Enable indexing at runtime via [index] enabled = true in config.

Why Code RAG

Cloud models with 200K token windows can afford multi-round agentic grep. Local models with 8K-32K windows cannot: a single grep cycle costs ~2K tokens (25% of an 8K budget), while 5 rounds would exceed the entire context. RAG retrieves 6-8 relevant chunks in ~3K tokens, preserving budget for history and response.

For cloud models, code RAG serves as pre-fill context alongside agentic search. For local models, it is the primary code retrieval mechanism.

Setup

  1. Start Qdrant (required for vector storage):

    docker compose up -d qdrant
    
  2. Enable indexing in config:

    [index]
    enabled = true
    
  3. Index your project:

    zeph index
    

    Or let auto-indexing handle it on startup when auto_index = true (default).

Architecture

The zeph-index crate contains 7 modules:

ModulePurpose
languagesLanguage detection from file extensions, tree-sitter grammar registry
chunkerAST-based chunking with greedy sibling merge (cAST-inspired algorithm)
contextContextualized embedding text generation (file path + scope + imports + code)
storeDual-write storage: Qdrant vectors + SQLite chunk metadata
indexerOrchestrator: walk project tree, chunk files, embed, store with incremental change detection
retrieverQuery classification, semantic search, budget-aware chunk packing
repo_mapCompact structural map of the project (signatures only, no function bodies)

Pipeline

Source files
    |
    v
[languages.rs] detect language, load grammar
    |
    v
[chunker.rs] parse AST, split into chunks (target: ~600 non-ws chars)
    |
    v
[context.rs] prepend file path, scope chain, imports, language tag
    |
    v
[indexer.rs] embed via LlmProvider, skip unchanged (content hash)
    |
    v
[store.rs] upsert to Qdrant (vectors) + SQLite (metadata)

Retrieval

User query
    |
    v
[retriever.rs] classify_query()
    |
    +--> Semantic  --> embed query --> Qdrant search --> budget pack --> inject
    |
    +--> Grep      --> return empty (agent uses bash tools)
    |
    +--> Hybrid    --> semantic search + hint to agent

Query Classification

The retriever classifies each query to route it to the appropriate search strategy:

StrategyTriggerAction
GrepExact symbols: ::, fn , struct , CamelCase, snake_case identifiersAgent handles via shell grep/ripgrep
SemanticConceptual queries: “how”, “where”, “why”, “explain”Vector similarity search in Qdrant
HybridBoth symbol patterns and conceptual wordsSemantic search + hint that grep may also help

Default (no pattern match): Semantic.

AST-Based Chunking

Files are parsed via tree-sitter into AST, then chunked by entity boundaries (functions, structs, classes, impl blocks). The algorithm uses greedy sibling merge:

  • Target size: 600 non-whitespace characters (~300-400 tokens)
  • Max size: 1200 non-ws chars (forced recursive split)
  • Min size: 100 non-ws chars (merge with adjacent sibling)

Config files (TOML, JSON, Markdown, Bash) are indexed as single file-level chunks since they lack named entities.

Each chunk carries rich metadata: file path, language, AST node type, entity name, line range, scope chain (e.g. MyStruct > impl MyStruct > my_method), imports, and a BLAKE3 content hash for change detection.

Contextualized Embeddings

Embedding raw code alone yields poor retrieval quality for conceptual queries. Before embedding, each chunk is prepended with:

  • File path (# src/agent.rs)
  • Scope chain (# Scope: Agent > prepare_context)
  • Language tag (# Language: rust)
  • First 5 import/use statements

This contextualized form improves retrieval for queries like “where is auth handled?” where the code alone might not contain the word “auth”.

Storage

Chunks are dual-written to two stores:

StoreDataPurpose
Qdrant (zeph_code_chunks)Embedding vectors + payload (code, metadata)Semantic similarity search
SQLite (chunk_metadata)File path, content hash, line range, language, node typeChange detection, cleanup of deleted files

The Qdrant collection uses INT8 scalar quantization for ~4x memory reduction with minimal accuracy loss. Payload indexes on language, file_path, and node_type enable filtered search.

Incremental Indexing

On subsequent runs, the indexer skips unchanged chunks by checking BLAKE3 content hashes in SQLite. Only modified or new files are re-embedded. Deleted files are detected by comparing the current file set against the SQLite index, and their chunks are removed from both stores.

File Watcher

When watch = true (default), an IndexWatcher monitors project files for changes during the session. On file modification, the changed file is automatically re-indexed via reindex_file() without rebuilding the entire index. The watcher uses 1-second debounce to batch rapid changes and only processes files with indexable extensions.

Disable with:

[index]
watch = false

Repo Map

A lightweight structural map of the project generated via tree-sitter ts-query. Included in the system prompt and cached with a configurable TTL (default: 5 minutes) to avoid per-message filesystem traversal.

For each supported language, tree-sitter queries extract SymbolInfo records — name, kind (function, struct, class, impl, etc.), visibility (pub/private), and line number — directly from the AST. This replaces the previous heuristic regex approach and adds accurate multi-language support.

The repo map is injected unconditionally for all providers (Claude, OpenAI, Ollama, and others). Qdrant semantic retrieval remains provider-dependent and only runs when embeddings are available.

Example output:

<repo_map>
  src/agent.rs :: pub struct Agent (line 12), pub fn new (line 45), pub fn run (line 78), fn prepare_context (line 110)
  src/config.rs :: pub struct Config (line 5), pub fn load (line 30)
  src/main.rs :: pub fn main (line 1), fn setup_logging (line 15)
  ... and 12 more files
</repo_map>

The map is budget-constrained (default: 1024 tokens) and sorted by symbol count (files with more symbols appear first). It gives the model a structural overview of the project without consuming significant context.

LSP Hover Pre-filter

When the lsp-context feature is enabled, zeph-index pre-filters hover requests before forwarding them to the language server. Previously this filter used a Rust-only regex; it now uses tree-sitter to identify the symbol under the cursor for all supported languages (Rust, Python, JavaScript, TypeScript, Go).

The tree-sitter hover pre-filter:

  1. Parses the file with the appropriate grammar.
  2. Finds the AST node at the cursor position.
  3. Walks up the tree to the nearest named symbol (identifier, field expression, call expression, etc.).
  4. Passes the resolved symbol to the MCP LSP server for a hover lookup.

This makes hover-based context injection accurate across all indexed languages, not just Rust.

Budget-Aware Retrieval

Retrieved chunks are packed into a token budget (default: 40% of available context for code). Chunks are sorted by similarity score and greedily packed until the budget is exhausted. A minimum score threshold (default: 0.25) filters low-relevance results.

Retrieved code is injected as a transient <code_context> XML block before the conversation history. It is re-generated on every turn and never persisted.

Context Window Layout (with Code RAG)

When code indexing is enabled, the context window includes two additional sections:

+---------------------------------------------------+
| System prompt + environment + ZEPH.md             |
+---------------------------------------------------+
| <repo_map> (structural overview, cached)          |  <= 1024 tokens
+---------------------------------------------------+
| <available_skills>                                |
+---------------------------------------------------+
| <code_context> (per-query RAG chunks, transient)  |  <= 30% available
+---------------------------------------------------+
| [semantic recall] past messages                   |  <= 10% available
+---------------------------------------------------+
| Recent message history                            |  <= 50% available
+---------------------------------------------------+
| [response reserve]                                |  20% of total
+---------------------------------------------------+

Configuration

[index]
# Enable codebase indexing for semantic code search.
# Requires Qdrant running (uses separate collection "zeph_code_chunks").
enabled = false

# Auto-index on startup and re-index changed files during session.
auto_index = true

# Directories to index (relative to cwd).
paths = ["."]

# Patterns to exclude (in addition to .gitignore).
exclude = ["target", "node_modules", ".git", "vendor", "dist", "build", "__pycache__"]

# Token budget for repo map in system prompt (0 = no repo map).
repo_map_budget = 1024

# Cache TTL for repo map in seconds (avoids per-message regeneration).
repo_map_ttl_secs = 300

[index.chunker]
# Target chunk size in non-whitespace characters (~300-400 tokens).
target_size = 600
# Maximum chunk size before forced split.
max_size = 1200
# Minimum chunk size — smaller chunks merge with siblings.
min_size = 100

[index.retrieval]
# Maximum chunks to fetch from Qdrant (before budget packing).
max_chunks = 12
# Minimum cosine similarity score to accept.
score_threshold = 0.25
# Maximum fraction of available context budget for code chunks.
budget_ratio = 0.40

Supported Languages

All tree-sitter grammars are compiled into every build. Language sub-features on zeph-index (lang-rust, lang-python, lang-js, lang-go, lang-config) are all enabled by default and cannot be individually disabled in the standard build.

LanguageFeatureExtensions
Rustlang-rust.rs
Pythonlang-python.py, .pyi
JavaScriptlang-js.js, .jsx, .mjs, .cjs
TypeScriptlang-js.ts, .tsx, .mts, .cts
Golang-go.go
Bashlang-config.sh, .bash, .zsh
TOMLlang-config.toml
JSONlang-config.json, .jsonc
Markdownlang-config.md, .markdown

Environment Variables

VariableDescriptionDefault
ZEPH_INDEX_ENABLEDEnable code indexingfalse
ZEPH_INDEX_AUTO_INDEXAuto-index on startuptrue
ZEPH_INDEX_REPO_MAP_BUDGETToken budget for repo map1024
ZEPH_INDEX_REPO_MAP_TTL_SECSCache TTL for repo map in seconds300

Code Index as MCP Tools

When index.mcp_enabled = true, the code index is exposed as an in-process MCP server (IndexMcpServer) that registers four navigation tools directly into the tool executor pipeline. No JSON-RPC transport is involved — the tools run in-process alongside external MCP servers.

Exposed Tools

ToolInputDescription
symbol_definitionname: StringReturns file path and line number for all definitions of a symbol (function, struct, enum, trait, etc.) found via tree-sitter AST
find_text_referencesname: StringTextual search for references to a symbol across all indexed files; may include false positives from comments and strings
call_graphfn_name: StringReturns a heuristic call graph rooted at the given function, derived from child symbol relationships in the AST
module_summarypath: StringLists all symbols (name, kind, visibility, line number) defined in a given source file

How This Differs from Repo Map Injection

The repo map (repo_map_budget) is a static overview injected once per system prompt. It lists symbol names and locations but does not answer specific queries. The MCP tools are dynamic: the LLM calls them on demand to answer precise navigation questions, similar to IDE “go to definition” or “find references”. This is more token-efficient for targeted lookups and avoids injecting an entire structural overview when only one symbol matters.

CapabilityRepo MapMCP Tools
Always present in contextYesNo (on-demand)
Find definition of one symbolNoYes (symbol_definition)
List all symbols in a fileNoYes (module_summary)
Find all usages of a symbolNoYes (find_text_references)
Call chain from a functionNoYes (call_graph)

Configuration

[index]
enabled     = true
mcp_enabled = true   # expose index as MCP tools

mcp_enabled defaults to false. Enabling it does not require Qdrant — the tool index is built directly from tree-sitter AST parsing and held in memory.

When to Use

Enable mcp_enabled for IDE-like workflows where the LLM needs to navigate the codebase interactively: tracing a call chain, checking where a struct is defined, or listing all symbols in a module. For large codebases where a full repo map would exceed the context budget, MCP tools provide targeted lookups without the token overhead.

The two mechanisms complement each other: repo map gives the model a high-level structural overview, and MCP tools let it drill into specific locations on demand.

Embedding Model Recommendations

The indexer uses the same LlmProvider.embed() as semantic memory. Any embedding model works. For code-heavy workloads:

ModelDimsNotes
qwen3-embedding1024Current Zeph default, good general performance
nomic-embed-text768Lightweight universal model
nomic-embed-code768Optimized for code, higher RAM (~7.5GB)