Zeph
Lightweight AI agent with hybrid inference (Ollama / Claude / OpenAI / HuggingFace via candle), skills-first architecture, semantic memory with Qdrant, MCP client, A2A protocol support, multi-model orchestration, self-learning skill evolution, and multi-channel I/O.
Only relevant skills and MCP tools are injected into each prompt via vector similarity — keeping token usage minimal regardless of how many are installed.
Cross-platform: Linux, macOS, Windows (x86_64 + ARM64).
Key Features
- Hybrid inference — Ollama (local), Claude (Anthropic), OpenAI (GPT + compatible APIs), Candle (HuggingFace GGUF)
- Skills-first architecture — embedding-based skill matching selects only top-K relevant skills per query, not all
- Semantic memory — SQLite for structured data + Qdrant for vector similarity search
- MCP client — connect external tool servers via Model Context Protocol (stdio + HTTP transport)
- A2A protocol — agent-to-agent communication via JSON-RPC 2.0 with SSE streaming
- Model orchestrator — route tasks to different providers with automatic fallback chains
- Self-learning — skills evolve through failure detection, self-reflection, and LLM-generated improvements
- Code indexing — AST-based code RAG with tree-sitter, hybrid retrieval (semantic + grep routing), repo map
- Context engineering — proportional budget allocation, semantic recall injection, runtime compaction, smart tool output summarization, ZEPH.md project config
- Multi-channel I/O — CLI, Telegram, and TUI with streaming support
- Token-efficient — prompt size is O(K) not O(N), where K is max active skills and N is total installed
Quick Start
git clone https://github.com/bug-ops/zeph
cd zeph
cargo build --release
./target/release/zeph
See Installation for pre-built binaries and Docker options.
Requirements
- Rust 1.88+ (Edition 2024)
- Ollama (for local inference and embeddings) or cloud API key (Claude / OpenAI)
- Docker (optional, for Qdrant semantic memory and containerized deployment)
Installation
Install Zeph from source, pre-built binaries, or Docker.
From Source
git clone https://github.com/bug-ops/zeph
cd zeph
cargo build --release
The binary is produced at target/release/zeph.
Pre-built Binaries
Download from GitHub Releases:
| Platform | Architecture | Download |
|---|---|---|
| Linux | x86_64 | zeph-x86_64-unknown-linux-gnu.tar.gz |
| Linux | aarch64 | zeph-aarch64-unknown-linux-gnu.tar.gz |
| macOS | x86_64 | zeph-x86_64-apple-darwin.tar.gz |
| macOS | aarch64 | zeph-aarch64-apple-darwin.tar.gz |
| Windows | x86_64 | zeph-x86_64-pc-windows-msvc.zip |
Docker
Pull the latest image from GitHub Container Registry:
docker pull ghcr.io/bug-ops/zeph:latest
Or use a specific version:
docker pull ghcr.io/bug-ops/zeph:v0.9.5
Images are scanned with Trivy in CI/CD and use Oracle Linux 9 Slim base with 0 HIGH/CRITICAL CVEs. Multi-platform: linux/amd64, linux/arm64.
See Docker Deployment for full deployment options including GPU support and age vault.
Quick Start
Run Zeph after building and interact via CLI, Telegram, or a cloud provider.
CLI Mode (default)
Unix (Linux/macOS):
./target/release/zeph
Windows:
.\target\release\zeph.exe
Type messages at the You: prompt. Type exit, quit, or press Ctrl-D to stop.
Telegram Mode
Unix (Linux/macOS):
ZEPH_TELEGRAM_TOKEN="123:ABC" ./target/release/zeph
Windows:
$env:ZEPH_TELEGRAM_TOKEN="123:ABC"; .\target\release\zeph.exe
Restrict access by setting telegram.allowed_users in the config file:
[telegram]
allowed_users = ["your_username"]
Ollama Setup
When using Ollama (default provider), ensure both the LLM model and embedding model are pulled:
ollama pull mistral:7b
ollama pull qwen3-embedding
The default configuration uses mistral:7b for text generation and qwen3-embedding for vector embeddings.
Cloud Providers
For Claude:
ZEPH_CLAUDE_API_KEY=sk-ant-... ./target/release/zeph
For OpenAI:
ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... ./target/release/zeph
See Configuration for the full reference.
Configuration
Zeph loads config/default.toml at startup and applies environment variable overrides.
The config path can be overridden via CLI argument or environment variable:
# CLI argument (highest priority)
zeph --config /path/to/custom.toml
# Environment variable
ZEPH_CONFIG=/path/to/custom.toml zeph
# Default (fallback)
# config/default.toml
Priority: --config > ZEPH_CONFIG > config/default.toml.
Hot-Reload
Zeph watches the config file for changes and applies runtime-safe fields without restart. The file watcher uses 500ms debounce to avoid redundant reloads.
Reloadable fields (applied immediately):
| Section | Fields |
|---|---|
[security] | redact_secrets |
[timeouts] | llm_seconds, embedding_seconds, a2a_seconds |
[memory] | history_limit, summarization_threshold, context_budget_tokens, compaction_threshold, compaction_preserve_tail, prune_protect_tokens, cross_session_score_threshold |
[memory.semantic] | recall_limit |
[index] | repo_map_ttl_secs, watch |
[agent] | max_tool_iterations |
[skills] | max_active_skills |
Not reloadable (require restart): LLM provider/model, SQLite path, Qdrant URL, Telegram token, MCP servers, A2A config, skill paths.
Check for config reloaded in the log to confirm a successful reload.
Configuration File
[agent]
name = "Zeph"
max_tool_iterations = 10 # Max tool loop iterations per response (default: 10)
[llm]
provider = "ollama"
base_url = "http://localhost:11434"
model = "mistral:7b"
embedding_model = "qwen3-embedding" # Model for text embeddings
[llm.cloud]
model = "claude-sonnet-4-5-20250929"
max_tokens = 4096
# [llm.openai]
# base_url = "https://api.openai.com/v1"
# model = "gpt-5.2"
# max_tokens = 4096
# embedding_model = "text-embedding-3-small"
# reasoning_effort = "medium" # low, medium, high (for reasoning models)
[skills]
paths = ["./skills"]
max_active_skills = 5 # Top-K skills per query via embedding similarity
[memory]
sqlite_path = "./data/zeph.db"
history_limit = 50
summarization_threshold = 100 # Trigger summarization after N messages
context_budget_tokens = 0 # 0 = unlimited (proportional split: 15% summaries, 25% recall, 60% recent)
compaction_threshold = 0.75 # Compact when context usage exceeds this fraction
compaction_preserve_tail = 4 # Keep last N messages during compaction
prune_protect_tokens = 40000 # Protect recent N tokens from tool output pruning
cross_session_score_threshold = 0.35 # Minimum relevance for cross-session results
[memory.semantic]
enabled = false # Enable semantic search via Qdrant
recall_limit = 5 # Number of semantically relevant messages to inject
[tools]
enabled = true
summarize_output = false # LLM-based summarization for long tool outputs
[tools.shell]
timeout = 30
blocked_commands = []
allowed_commands = []
allowed_paths = [] # Directories shell can access (empty = cwd only)
allow_network = true # false blocks curl/wget/nc
confirm_patterns = ["rm ", "git push -f", "git push --force", "drop table", "drop database", "truncate "]
[tools.file]
allowed_paths = [] # Directories file tools can access (empty = cwd only)
# Pattern-based permissions per tool (optional; overrides legacy blocked_commands/confirm_patterns)
# [tools.permissions.bash]
# [[tools.permissions.bash]]
# pattern = "*sudo*"
# action = "deny"
# [[tools.permissions.bash]]
# pattern = "cargo *"
# action = "allow"
# [[tools.permissions.bash]]
# pattern = "*"
# action = "ask"
[tools.scrape]
timeout = 15
max_body_bytes = 1048576 # 1MB
[tools.audit]
enabled = false # Structured JSON audit log for tool executions
destination = "stdout" # "stdout" or file path
[security]
redact_secrets = true # Redact API keys/tokens in LLM responses
[timeouts]
llm_seconds = 120 # LLM chat completion timeout
embedding_seconds = 30 # Embedding generation timeout
a2a_seconds = 30 # A2A remote call timeout
[vault]
backend = "env" # "env" (default) or "age"; CLI --vault overrides this
[a2a]
enabled = false
host = "0.0.0.0"
port = 8080
# public_url = "https://agent.example.com"
# auth_token = "secret"
rate_limit = 60
Shell commands are sandboxed with path restrictions, network control, and destructive command confirmation. See Security for details.
Environment Variables
| Variable | Description |
|---|---|
ZEPH_LLM_PROVIDER | ollama, claude, openai, candle, or orchestrator |
ZEPH_LLM_BASE_URL | Ollama API endpoint |
ZEPH_LLM_MODEL | Model name for Ollama |
ZEPH_LLM_EMBEDDING_MODEL | Embedding model for Ollama (default: qwen3-embedding) |
ZEPH_CLAUDE_API_KEY | Anthropic API key (required for Claude) |
ZEPH_OPENAI_API_KEY | OpenAI API key (required for OpenAI provider) |
ZEPH_TELEGRAM_TOKEN | Telegram bot token (enables Telegram mode) |
ZEPH_SQLITE_PATH | SQLite database path |
ZEPH_QDRANT_URL | Qdrant server URL (default: http://localhost:6334) |
ZEPH_MEMORY_SUMMARIZATION_THRESHOLD | Trigger summarization after N messages (default: 100) |
ZEPH_MEMORY_CONTEXT_BUDGET_TOKENS | Context budget for proportional token allocation (default: 0 = unlimited) |
ZEPH_MEMORY_COMPACTION_THRESHOLD | Compaction trigger threshold as fraction of context budget (default: 0.75) |
ZEPH_MEMORY_COMPACTION_PRESERVE_TAIL | Messages preserved during compaction (default: 4) |
ZEPH_MEMORY_PRUNE_PROTECT_TOKENS | Tokens protected from Tier 1 tool output pruning (default: 40000) |
ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLD | Minimum relevance score for cross-session memory (default: 0.35) |
ZEPH_MEMORY_SEMANTIC_ENABLED | Enable semantic memory with Qdrant (default: false) |
ZEPH_MEMORY_RECALL_LIMIT | Max semantically relevant messages to recall (default: 5) |
ZEPH_SKILLS_MAX_ACTIVE | Max skills per query via embedding match (default: 5) |
ZEPH_AGENT_MAX_TOOL_ITERATIONS | Max tool loop iterations per response (default: 10) |
ZEPH_TOOLS_SUMMARIZE_OUTPUT | Enable LLM-based tool output summarization (default: false) |
ZEPH_TOOLS_TIMEOUT | Shell command timeout in seconds (default: 30) |
ZEPH_TOOLS_SCRAPE_TIMEOUT | Web scrape request timeout in seconds (default: 15) |
ZEPH_TOOLS_SCRAPE_MAX_BODY | Max response body size in bytes (default: 1048576) |
ZEPH_A2A_ENABLED | Enable A2A server (default: false) |
ZEPH_A2A_HOST | A2A server bind address (default: 0.0.0.0) |
ZEPH_A2A_PORT | A2A server port (default: 8080) |
ZEPH_A2A_PUBLIC_URL | Public URL for agent card discovery |
ZEPH_A2A_AUTH_TOKEN | Bearer token for A2A server authentication |
ZEPH_A2A_RATE_LIMIT | Max requests per IP per minute (default: 60) |
ZEPH_A2A_REQUIRE_TLS | Require HTTPS for outbound A2A connections (default: true) |
ZEPH_A2A_SSRF_PROTECTION | Block private/loopback IPs in A2A client (default: true) |
ZEPH_A2A_MAX_BODY_SIZE | Max request body size in bytes (default: 1048576) |
ZEPH_TOOLS_FILE_ALLOWED_PATHS | Comma-separated directories file tools can access (empty = cwd) |
ZEPH_TOOLS_SHELL_ALLOWED_PATHS | Comma-separated directories shell can access (empty = cwd) |
ZEPH_TOOLS_SHELL_ALLOW_NETWORK | Allow network commands from shell (default: true) |
ZEPH_TOOLS_AUDIT_ENABLED | Enable audit logging for tool executions (default: false) |
ZEPH_TOOLS_AUDIT_DESTINATION | Audit log destination: stdout or file path |
ZEPH_SECURITY_REDACT_SECRETS | Redact secrets in LLM responses (default: true) |
ZEPH_TIMEOUT_LLM | LLM call timeout in seconds (default: 120) |
ZEPH_TIMEOUT_EMBEDDING | Embedding generation timeout in seconds (default: 30) |
ZEPH_TIMEOUT_A2A | A2A remote call timeout in seconds (default: 30) |
ZEPH_CONFIG | Path to config file (default: config/default.toml) |
ZEPH_TUI | Enable TUI dashboard: true or 1 (requires tui feature) |
Skills
Zeph uses an embedding-based skill system that dramatically reduces token consumption: instead of injecting all skills into every prompt, only the top-K most relevant (default: 5) are selected per query via cosine similarity of vector embeddings. Combined with progressive loading (metadata at startup, bodies on activation, resources on demand), this keeps prompt size constant regardless of how many skills are installed.
How It Works
- You send a message — for example, “check disk usage on this server”
- Zeph embeds your query using the configured embedding model
- Top matching skills are selected — by default, the 5 most relevant ones ranked by vector similarity
- Selected skills are injected into the system prompt, giving Zeph specific instructions and examples for the task
- Zeph responds using the knowledge from matched skills
This happens automatically on every message. You don’t need to activate skills manually.
Matching Backends
Zeph supports two skill matching backends:
- In-memory (default) — embeddings are computed on startup and matched via cosine similarity. No external dependencies required.
- Qdrant — when semantic memory is enabled and Qdrant is reachable, skill embeddings are persisted in a
zeph_skillscollection. On startup, only changed skills are re-embedded using BLAKE3 content hash comparison. If Qdrant becomes unavailable, Zeph falls back to in-memory matching automatically.
The Qdrant backend significantly reduces startup time when you have many skills, since unchanged skills skip the embedding step entirely.
Bundled Skills
| Skill | Description |
|---|---|
api-request | HTTP API requests using curl — GET, POST, PUT, DELETE with headers and JSON |
docker | Docker container operations — build, run, ps, logs, compose |
file-ops | File system operations — list, search, read, and analyze files |
git | Git version control — status, log, diff, commit, branch management |
mcp-generate | Generate MCP-to-skill bridges for external tool servers |
setup-guide | Configuration reference — LLM providers, memory, tools, and operating modes |
skill-audit | Spec compliance and security review of installed skills |
skill-creator | Create new skills following the agentskills.io specification |
system-info | System diagnostics — OS, disk, memory, processes, uptime |
web-scrape | Extract structured data from web pages using CSS selectors |
web-search | Search the internet for current information |
Use /skills in chat to see all available skills and their usage statistics.
Creating Custom Skills
A skill is a single SKILL.md file inside a named directory:
skills/
└── my-skill/
└── SKILL.md
SKILL.md Format
Each file has two parts: a YAML header and a markdown body.
---
name: my-skill
description: Short description of what this skill does.
---
# My Skill
Instructions and examples go here.
Header fields:
| Field | Required | Description |
|---|---|---|
name | Yes | Unique identifier (1-64 chars, lowercase, hyphens allowed) |
description | Yes | Used for embedding-based matching against user queries |
compatibility | No | Runtime requirements (e.g., “requires curl”) |
license | No | Skill license |
allowed-tools | No | Comma-separated tool names this skill can use |
metadata | No | Arbitrary key-value pairs for forward compatibility |
Body: markdown with instructions, code examples, or reference material. Injected verbatim into the LLM context when the skill is selected.
Skill Resources
Skills can include additional resource directories:
skills/
└── system-info/
├── SKILL.md
└── references/
├── linux.md
├── macos.md
└── windows.md
Resources in scripts/, references/, and assets/ are loaded on demand with path traversal protection. OS-specific reference files (named linux.md, macos.md, windows.md) are automatically filtered by the current platform.
Name Validation
Skill names must be 1-64 characters, lowercase letters/numbers/hyphens only, no leading/trailing/consecutive hyphens, and must match the directory name.
Configuration
Skill Paths
By default, Zeph scans ./skills for skill directories. Add more paths in config:
[skills]
paths = ["./skills", "/home/user/my-skills"]
If a skill with the same name appears in multiple paths, the first one found takes priority.
Max Active Skills
Control how many skills are injected per query:
[skills]
max_active_skills = 5
Or via environment variable:
export ZEPH_SKILLS_MAX_ACTIVE=5
Lower values reduce prompt size but may miss relevant skills. Higher values include more context but use more tokens.
Progressive Loading
Only metadata (~100 tokens per skill) is loaded at startup for embedding and matching. Full body (<5000 tokens) is loaded lazily on first activation and cached via OnceLock. Resource files are loaded on demand.
With 50+ skills installed, a typical prompt still contains only 5 — saving thousands of tokens per request compared to naive full-injection approaches.
Hot Reload
SKILL.md file changes are detected via filesystem watcher (500ms debounce) and re-embedded without restart. Cached bodies are invalidated on reload.
With the Qdrant backend, hot-reload triggers a delta sync — only modified skills are re-embedded and updated in the collection.
Semantic Memory
Enable semantic search to retrieve contextually relevant messages from conversation history using vector similarity.
Requires an embedding model. Ollama with qwen3-embedding is the default. Claude API does not support embeddings natively — use the orchestrator to route embeddings through Ollama while using Claude for chat.
Setup
-
Start Qdrant:
docker compose up -d qdrant -
Enable semantic memory in config:
[memory.semantic] enabled = true recall_limit = 5 -
Automatic setup: Qdrant collection (
zeph_conversations) is created automatically on first use with correct vector dimensions (1024 forqwen3-embedding) and Cosine distance metric. No manual initialization required.
How It Works
- Automatic embedding: Messages are embedded asynchronously using the configured
embedding_modeland stored in Qdrant alongside SQLite. - Semantic recall: Context builder injects semantically relevant messages from full history, not just recent messages.
- Graceful degradation: If Qdrant is unavailable, Zeph falls back to SQLite-only mode (recency-based history).
- Startup backfill: On startup, if Qdrant is available, Zeph calls
embed_missing()to backfill embeddings for any messages stored while Qdrant was offline. This ensures the vector index stays in sync with SQLite without manual intervention.
Storage Architecture
| Store | Purpose |
|---|---|
| SQLite | Source of truth for message text, conversations, summaries, skill usage |
| Qdrant | Vector index for semantic similarity search (embeddings only) |
Both stores work together: SQLite holds the data, Qdrant enables vector search over it. The embeddings_metadata table in SQLite maps message IDs to Qdrant point IDs.
Context Engineering
Zeph’s context engineering pipeline manages how information flows into the LLM context window. It combines semantic recall, proportional budget allocation, message trimming, environment injection, tool output management, and runtime compaction into a unified system.
All context engineering features are disabled by default (context_budget_tokens = 0). Set a non-zero budget or enable auto_budget = true to activate the pipeline.
Configuration
[memory]
context_budget_tokens = 128000 # Set to your model's context window size (0 = unlimited)
compaction_threshold = 0.75 # Compact when usage exceeds this fraction
compaction_preserve_tail = 4 # Keep last N messages during compaction
prune_protect_tokens = 40000 # Protect recent N tokens from Tier 1 tool output pruning
cross_session_score_threshold = 0.35 # Minimum relevance for cross-session results (0.0-1.0)
[memory.semantic]
enabled = true # Required for semantic recall
recall_limit = 5 # Max semantically relevant messages to inject
[tools]
summarize_output = false # Enable LLM-based tool output summarization
Context Window Layout
When context_budget_tokens > 0, the context window is structured as:
┌─────────────────────────────────────────────────┐
│ BASE_PROMPT (identity + guidelines + security) │ ~300 tokens
├─────────────────────────────────────────────────┤
│ <environment> cwd, git branch, os, model │ ~50 tokens
├─────────────────────────────────────────────────┤
│ <project_context> ZEPH.md contents │ 0-500 tokens
├─────────────────────────────────────────────────┤
│ <repo_map> structural overview (if index on) │ 0-1024 tokens
├─────────────────────────────────────────────────┤
│ <available_skills> matched skills (full body) │ 200-2000 tokens
│ <other_skills> remaining (description-only) │ 50-200 tokens
├─────────────────────────────────────────────────┤
│ <code_context> RAG chunks (if index on) │ 30% of available
├─────────────────────────────────────────────────┤
│ [semantic recall] relevant past messages │ 10-25% of available
├─────────────────────────────────────────────────┤
│ [compaction summary] if compacted │ 200-500 tokens
├─────────────────────────────────────────────────┤
│ Recent message history │ 50-60% of available
├─────────────────────────────────────────────────┤
│ [reserved for response generation] │ 20% of total
└─────────────────────────────────────────────────┘
Proportional Budget Allocation
Available tokens (after reserving 20% for response) are split proportionally. When code indexing is enabled, the code context slot takes a share from summaries, recall, and history:
| Allocation | Without code index | With code index | Purpose |
|---|---|---|---|
| Summaries | 15% | 10% | Conversation summaries from SQLite |
| Semantic recall | 25% | 10% | Relevant messages from past conversations via Qdrant |
| Code context | – | 30% | Retrieved code chunks from project index |
| Recent history | 60% | 50% | Most recent messages in current conversation |
Semantic Recall Injection
When semantic memory is enabled, the agent queries Qdrant for messages relevant to the current user query. Results are injected as transient system messages (prefixed with [semantic recall]) that are:
- Removed and re-injected on every turn (never stale)
- Not persisted to SQLite
- Bounded by the allocated token budget (25%, or 10% when code indexing is enabled)
Requires Qdrant and memory.semantic.enabled = true.
Message History Trimming
When recent messages exceed the 60% budget allocation, the oldest non-system messages are evicted. The system prompt and most recent messages are always preserved.
Environment Context
Every system prompt rebuild injects an <environment> block with:
- Working directory
- OS (linux, macos, windows)
- Current git branch (if in a git repo)
- Active model name
Two-Tier Context Pruning
When total message tokens exceed compaction_threshold (default: 75%) of the context budget, a two-tier pruning strategy activates:
Tier 1: Selective Tool Output Pruning
Before invoking the LLM for compaction, Zeph scans messages outside the protected tail for ToolOutput parts and replaces their content with a short placeholder. This is a cheap, synchronous operation that often frees enough tokens to stay under the threshold without an LLM call.
- Only tool outputs in messages older than the protected tail are pruned
- The most recent
prune_protect_tokenstokens (default: 40,000) worth of messages are never pruned, preserving recent tool context - Pruned parts have their
compacted_attimestamp set, body is cleared from memory to reclaim heap, and they are not pruned again - Pruned parts are persisted to SQLite before clearing, so pruning state survives session restarts
- The
tool_output_prunesmetric tracks how many parts were pruned
Tier 2: LLM Compaction (Fallback)
If Tier 1 does not free enough tokens, the standard LLM compaction runs:
- Middle messages (between system prompt and last N recent) are extracted
- Sent to the LLM with a structured summarization prompt
- Replaced with a single summary message
- Last
compaction_preserve_tailmessages (default: 4) are always preserved
Both tiers are idempotent and run automatically during the agent loop.
Tool Output Management
Truncation
Tool outputs exceeding 30,000 characters are automatically truncated using a head+tail split with UTF-8 safe boundaries. Both the first and last ~15K chars are preserved.
Smart Summarization
When tools.summarize_output = true, long tool outputs are sent through the LLM with a prompt that preserves file paths, error messages, and numeric values. On LLM failure, falls back to truncation.
export ZEPH_TOOLS_SUMMARIZE_OUTPUT=true
Progressive Skill Loading
Skills matched by embedding similarity (top-K) are injected with their full body. Remaining skills are listed in a description-only <other_skills> catalog — giving the model awareness of all capabilities while consuming minimal tokens.
ZEPH.md Project Config
Zeph walks up the directory tree from the current working directory looking for:
ZEPH.mdZEPH.local.md.zeph/config.md
Found configs are concatenated (global first, then ancestors from root to cwd) and injected into the system prompt as a <project_context> block. Use this to provide project-specific instructions.
Environment Variables
| Variable | Description | Default |
|---|---|---|
ZEPH_MEMORY_CONTEXT_BUDGET_TOKENS | Context budget in tokens | 0 (unlimited) |
ZEPH_MEMORY_COMPACTION_THRESHOLD | Compaction trigger threshold | 0.75 |
ZEPH_MEMORY_COMPACTION_PRESERVE_TAIL | Messages preserved during compaction | 4 |
ZEPH_MEMORY_PRUNE_PROTECT_TOKENS | Tokens protected from Tier 1 tool output pruning | 40000 |
ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLD | Minimum relevance score for cross-session memory results | 0.35 |
ZEPH_TOOLS_SUMMARIZE_OUTPUT | Enable LLM-based tool output summarization | false |
Conversation Summarization
Automatically compress long conversation histories using LLM-based summarization to stay within context budget limits.
Requires an LLM provider (Ollama or Claude). Set context_budget_tokens = 0 to disable proportional allocation and use unlimited context.
For the full context management pipeline (semantic recall, message trimming, compaction, tool output management), see Context Engineering.
Configuration
[memory]
summarization_threshold = 100
context_budget_tokens = 8000 # Set to LLM context window size (0 = unlimited)
How It Works
- Triggered when message count exceeds
summarization_threshold(default: 100) - Summaries stored in SQLite with token estimates
- Batch size = threshold/2 to balance summary quality with LLM call frequency
- Context builder allocates proportional token budget:
- 15% for summaries
- 25% for semantic recall (if enabled)
- 60% for recent message history
Token Estimation
Token counts are estimated using a chars/4 heuristic (100x faster than tiktoken, ±25% accuracy). This is sufficient for proportional budget allocation where exact counts are not critical.
Docker Deployment
Docker Compose automatically pulls the latest image from GitHub Container Registry. To use a specific version, set ZEPH_IMAGE=ghcr.io/bug-ops/zeph:v0.9.5.
Quick Start (Ollama + Qdrant in containers)
# Pull Ollama models first
docker compose --profile cpu run --rm ollama ollama pull mistral:7b
docker compose --profile cpu run --rm ollama ollama pull qwen3-embedding
# Start all services
docker compose --profile cpu up
Apple Silicon (Ollama on host with Metal GPU)
# Use Ollama on macOS host for Metal GPU acceleration
ollama pull mistral:7b
ollama pull qwen3-embedding
ollama serve &
# Start Zeph + Qdrant, connect to host Ollama
ZEPH_LLM_BASE_URL=http://host.docker.internal:11434 docker compose up
Linux with NVIDIA GPU
# Pull models first
docker compose --profile gpu run --rm ollama ollama pull mistral:7b
docker compose --profile gpu run --rm ollama ollama pull qwen3-embedding
# Start all services with GPU
docker compose --profile gpu -f docker-compose.yml -f docker-compose.gpu.yml up
Age Vault (Encrypted Secrets)
# Mount key and vault files into container
docker compose -f docker-compose.yml -f docker-compose.vault.yml up
Override file paths via environment variables:
ZEPH_VAULT_KEY=./my-key.txt ZEPH_VAULT_PATH=./my-secrets.age \
docker compose -f docker-compose.yml -f docker-compose.vault.yml up
The image must be built with
vault-agefeature enabled. For local builds, useCARGO_FEATURES=vault-agewithdocker-compose.dev.yml.
Using Specific Version
# Use a specific release version
ZEPH_IMAGE=ghcr.io/bug-ops/zeph:v0.9.5 docker compose up
# Always pull latest
docker compose pull && docker compose up
Local Development
Full stack with debug tracing (builds from source via Dockerfile.dev, uses host Ollama via host.docker.internal):
# Build and start Qdrant + Zeph with debug logging
docker compose -f docker-compose.dev.yml up --build
# Build with optional features (e.g. vault-age, candle)
CARGO_FEATURES=vault-age docker compose -f docker-compose.dev.yml up --build
# Build with vault-age and mount vault files
CARGO_FEATURES=vault-age \
docker compose -f docker-compose.dev.yml -f docker-compose.vault.yml up --build
Dependencies only (run zeph natively on host):
# Start Qdrant
docker compose -f docker-compose.deps.yml up
# Run zeph natively with debug tracing
RUST_LOG=zeph=debug,zeph_channels=trace cargo run
MCP Integration
Connect external tool servers via Model Context Protocol (MCP). Tools are discovered, embedded, and matched alongside skills using the same cosine similarity pipeline — only relevant MCP tools are injected into the prompt, so adding more servers does not inflate token usage.
Configuration
Stdio Transport (spawn child process)
[[mcp.servers]]
id = "filesystem"
command = "npx"
args = ["-y", "@anthropic/mcp-filesystem"]
HTTP Transport (remote server)
[[mcp.servers]]
id = "remote-tools"
url = "http://localhost:8080/mcp"
Security
[mcp]
allowed_commands = ["npx", "uvx", "node", "python", "python3"]
max_dynamic_servers = 10
allowed_commands restricts which binaries can be spawned as MCP servers. max_dynamic_servers limits the number of servers added at runtime.
Dynamic Management
Add and remove MCP servers at runtime via chat commands:
/mcp add filesystem npx -y @anthropic/mcp-filesystem
/mcp add remote-api http://localhost:8080/mcp
/mcp list
/mcp remove filesystem
After adding or removing a server, Qdrant registry syncs automatically for semantic tool matching.
How Matching Works
MCP tools are embedded in Qdrant (zeph_mcp_tools collection) with BLAKE3 content-hash delta sync. Unified matching injects both skills and MCP tools into the system prompt by relevance score — keeping prompt size O(K) instead of O(N) where N is total tools across all servers.
OpenAI Provider
Use the OpenAI provider to connect to OpenAI API or any OpenAI-compatible service (Together AI, Groq, Fireworks, Perplexity).
ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... ./target/release/zeph
Configuration
[llm]
provider = "openai"
[llm.openai]
base_url = "https://api.openai.com/v1"
model = "gpt-5.2"
max_tokens = 4096
embedding_model = "text-embedding-3-small" # optional, enables vector embeddings
reasoning_effort = "medium" # optional: low, medium, high (for o3, etc.)
Compatible APIs
Change base_url to point to any OpenAI-compatible API:
# Together AI
base_url = "https://api.together.xyz/v1"
# Groq
base_url = "https://api.groq.com/openai/v1"
# Fireworks
base_url = "https://api.fireworks.ai/inference/v1"
Embeddings
When embedding_model is set, Qdrant subsystems automatically use it for skill matching and semantic memory instead of the global llm.embedding_model.
Reasoning Models
Set reasoning_effort to control token budget for reasoning models like o3:
low— fast responses, less reasoningmedium— balancedhigh— thorough reasoning, more tokens
Local Inference (Candle)
Run HuggingFace GGUF models locally via candle without external API dependencies. Metal and CUDA GPU acceleration are supported.
cargo build --release --features candle,metal # macOS with Metal GPU
Configuration
[llm]
provider = "candle"
[llm.candle]
source = "huggingface"
repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
chat_template = "mistral" # llama3, chatml, mistral, phi3, raw
embedding_repo = "sentence-transformers/all-MiniLM-L6-v2" # optional BERT embeddings
[llm.candle.generation]
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
repeat_penalty = 1.1
Chat Templates
| Template | Models |
|---|---|
llama3 | Llama 3, Llama 3.1 |
chatml | Qwen, Yi, OpenHermes |
mistral | Mistral, Mixtral |
phi3 | Phi-3 |
raw | No template (raw completion) |
Device Auto-Detection
- macOS — Metal GPU (requires
--features metal) - Linux with NVIDIA — CUDA (requires
--features cuda) - Fallback — CPU
Model Orchestrator
Route tasks to different LLM providers based on content classification. Each task type maps to a provider chain with automatic fallback. Use the orchestrator to combine local and cloud models — for example, embeddings via Ollama and chat via Claude.
Configuration
[llm]
provider = "orchestrator"
[llm.orchestrator]
default = "claude"
embed = "ollama"
[llm.orchestrator.providers.ollama]
provider_type = "ollama"
[llm.orchestrator.providers.claude]
provider_type = "claude"
[llm.orchestrator.routes]
coding = ["claude", "ollama"] # try Claude first, fallback to Ollama
creative = ["claude"] # cloud only
analysis = ["claude", "ollama"] # prefer cloud
general = ["claude"] # cloud only
Provider Keys
default— provider for chat when no specific route matchesembed— provider for all embedding operations (skill matching, semantic memory)
Task Classification
Task types are classified via keyword heuristics:
| Task Type | Keywords |
|---|---|
coding | code, function, debug, refactor, implement |
creative | write, story, poem, creative |
analysis | analyze, compare, evaluate |
translation | translate, convert language |
summarization | summarize, summary, tldr |
general | everything else |
Fallback Chains
Routes define provider preference order. If the first provider fails, the next one in the list is tried automatically.
coding = ["local", "cloud"] # try local first, fallback to cloud
Hybrid Setup Example
Embeddings via free local Ollama, chat via paid Claude API:
[llm]
provider = "orchestrator"
[llm.orchestrator]
default = "claude"
embed = "ollama"
[llm.orchestrator.providers.ollama]
provider_type = "ollama"
[llm.orchestrator.providers.claude]
provider_type = "claude"
[llm.orchestrator.routes]
general = ["claude"]
Self-Learning Skills
Automatically improve skills based on execution outcomes. When a skill fails repeatedly, Zeph uses self-reflection and LLM-generated improvements to create better skill versions.
Configuration
[skills.learning]
enabled = true
auto_activate = false # require manual approval for new versions
min_failures = 3 # failures before triggering improvement
improve_threshold = 0.7 # success rate below which improvement starts
rollback_threshold = 0.5 # auto-rollback when success rate drops below this
min_evaluations = 5 # minimum evaluations before rollback decision
max_versions = 10 # max auto-generated versions per skill
cooldown_minutes = 60 # cooldown between improvements for same skill
How It Works
- Each skill invocation is tracked as success or failure
- When a skill’s success rate drops below
improve_threshold, Zeph triggers self-reflection - The agent retries with adjusted context (1 retry per message)
- If failures persist beyond
min_failures, the LLM generates an improved skill version - New versions can be auto-activated or held for manual approval
- If an activated version performs worse than
rollback_threshold, automatic rollback occurs
Chat Commands
| Command | Description |
|---|---|
/skill stats | View execution metrics per skill |
/skill versions | List auto-generated versions |
/skill activate <id> | Activate a specific version |
/skill approve <id> | Approve a pending version |
/skill reset <name> | Revert to original version |
/feedback | Provide explicit quality feedback |
Set
auto_activate = false(default) to review and manually approve LLM-generated skill improvements before they go live.
Skill versions and outcomes are stored in SQLite (skill_versions and skill_outcomes tables).
A2A Protocol
Zeph includes an embedded A2A protocol server for agent-to-agent communication. When enabled, other agents can discover and interact with Zeph via the standard A2A JSON-RPC 2.0 API.
Quick Start
ZEPH_A2A_ENABLED=true ZEPH_A2A_AUTH_TOKEN=secret ./target/release/zeph
Endpoints
| Endpoint | Description | Auth |
|---|---|---|
/.well-known/agent-card.json | Agent discovery | Public (no auth) |
/a2a | JSON-RPC endpoint (message/send, tasks/get, tasks/cancel) | Bearer token |
/a2a/stream | SSE streaming endpoint | Bearer token |
Set
ZEPH_A2A_AUTH_TOKENto secure the server with bearer token authentication. The agent card endpoint remains public per A2A spec.
Configuration
[a2a]
enabled = true
host = "0.0.0.0"
port = 8080
public_url = "https://agent.example.com"
auth_token = "secret"
rate_limit = 60
Network Security
- TLS enforcement:
a2a.require_tls = truerejects HTTP endpoints (HTTPS only) - SSRF protection:
a2a.ssrf_protection = trueblocks private IP ranges (RFC 1918, loopback, link-local) via DNS resolution - Payload limits:
a2a.max_body_sizecaps request body (default: 1 MiB) - Rate limiting: per-IP sliding window (default: 60 requests/minute)
Task Processing
Incoming message/send requests are routed through AgentTaskProcessor, which forwards the message to the configured LLM provider for real inference. The processor creates a task, sends the user message to the LLM, and returns the model response as a completed A2A task artifact.
Current limitation: the A2A task processor runs inference only (no tool execution or memory context).
A2A Client
Zeph can also connect to other A2A agents as a client:
A2aClientwraps reqwest, uses JSON-RPC 2.0 for all RPC callsAgentRegistrywith TTL-based cache for agent card discovery- SSE streaming via
eventsource-streamfor real-time task updates - Bearer token auth passed per-call to all client methods
Secrets Management
Zeph resolves secrets (ZEPH_CLAUDE_API_KEY, ZEPH_OPENAI_API_KEY, ZEPH_TELEGRAM_TOKEN, ZEPH_A2A_AUTH_TOKEN) through a pluggable VaultProvider with redacted debug output via the Secret newtype.
Never commit secrets to version control. Use environment variables or age-encrypted vault files.
Backend Selection
The vault backend is determined by the following priority (highest to lowest):
- CLI flag:
--vault envor--vault age - Environment variable:
ZEPH_VAULT_BACKEND - Config file:
vault.backendin TOML config - Default:
"env"
Backends
| Backend | Description | Activation |
|---|---|---|
env (default) | Read secrets from environment variables | --vault env or omit |
age | Decrypt age-encrypted JSON vault file at startup | --vault age --vault-key <identity> --vault-path <vault.age> |
Environment Variables (default)
Export secrets as environment variables:
export ZEPH_CLAUDE_API_KEY=sk-ant-...
export ZEPH_TELEGRAM_TOKEN=123:ABC
./target/release/zeph
Age Vault
For production deployments, encrypt secrets with age:
# Generate an age identity key
age-keygen -o key.txt
# Create a JSON secrets file and encrypt it
echo '{"ZEPH_CLAUDE_API_KEY":"sk-...","ZEPH_TELEGRAM_TOKEN":"123:ABC"}' | \
age -r $(grep 'public key' key.txt | awk '{print $NF}') -o secrets.age
# Run with age vault
cargo build --release --features vault-age
./target/release/zeph --vault age --vault-key key.txt --vault-path secrets.age
The
vault-agefeature flag is enabled by default. When building with--no-default-features, addvault-ageexplicitly if needed.
Docker
Mount key and vault files into the container:
docker compose -f docker-compose.yml -f docker-compose.vault.yml up
Override paths:
ZEPH_VAULT_KEY=./my-key.txt ZEPH_VAULT_PATH=./my-secrets.age \
docker compose -f docker-compose.yml -f docker-compose.vault.yml up
Channels
Zeph supports multiple I/O channels for interacting with the agent. Each channel implements the Channel trait and can be selected at runtime based on configuration or CLI flags.
Available Channels
| Channel | Activation | Streaming | Confirmation |
|---|---|---|---|
| CLI | Default (no config needed) | Token-by-token to stdout | y/N prompt |
| Telegram | ZEPH_TELEGRAM_TOKEN env var or [telegram] config | Edit-in-place every 10s | Reply “yes” to confirm |
| TUI | --tui flag or ZEPH_TUI=true (requires tui feature) | Real-time in chat panel | Auto-confirm (Phase 1) |
CLI Channel
The default channel. Reads from stdin, writes to stdout with immediate streaming output.
./zeph
No configuration required. Supports all slash commands (/skills, /mcp, /reset).
Telegram Channel
Run Zeph as a Telegram bot with streaming responses, MarkdownV2 formatting, and user whitelisting.
Setup
-
Create a bot via @BotFather:
- Send
/newbotand follow the prompts - Copy the bot token (e.g.,
123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11)
- Send
-
Configure the token via environment variable or config file:
# Environment variable ZEPH_TELEGRAM_TOKEN="123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11" ./zephOr in
config/default.toml:[telegram] allowed_users = ["your_username"]The token can also be stored in the age-encrypted vault:
# Store in vault ZEPH_TELEGRAM_TOKEN=your-token
The token is resolved via the vault provider (
ZEPH_TELEGRAM_TOKENsecret). When using theenvvault backend (default), set the environment variable directly. With theagebackend, store it in the encrypted vault file.
User Whitelisting
Restrict bot access to specific Telegram usernames:
[telegram]
allowed_users = ["alice", "bob"]
When allowed_users is empty, the bot accepts messages from all users. Messages from unauthorized users are silently rejected with a warning log.
Bot Commands
| Command | Description |
|---|---|
/start | Welcome message |
/reset | Reset conversation context |
/skills | List loaded skills |
Streaming Behavior
Telegram has API rate limits, so streaming works differently from CLI:
- First chunk sends a new message immediately
- Subsequent chunks edit the existing message in-place
- Updates are throttled to one edit per 10 seconds to respect Telegram rate limits
- On flush, a final edit delivers the complete response
- Long messages (>4096 chars) are automatically split into multiple messages
MarkdownV2 Formatting
LLM responses are automatically converted from standard Markdown to Telegram’s MarkdownV2 format. Code blocks, bold, italic, and inline code are preserved. Special characters are escaped to prevent formatting errors.
Confirmation Prompts
When the agent needs user confirmation (e.g., destructive shell commands), Telegram sends a text prompt asking the user to reply “yes” to confirm.
TUI Dashboard
A rich terminal interface based on ratatui with real-time agent metrics. Requires the tui feature flag.
cargo build --release --features tui
./zeph --tui
See TUI Dashboard for full documentation including keybindings, layout, and architecture.
Message Queueing
Zeph maintains a bounded FIFO message queue (maximum 10 messages) to handle user input received during model inference. Queue behavior varies by channel:
CLI Channel
Blocking stdin read — the queue is always empty. CLI users cannot send messages while the agent is responding.
Telegram Channel
New messages are queued via an internal mpsc channel. Consecutive messages arriving within 500ms are automatically merged with a newline separator to reduce context fragmentation.
Use /clear-queue to discard queued messages.
TUI Channel
The input line remains interactive during model inference. Messages are queued in-order and drained after each response completes.
- Queue badge:
[+N queued]appears in the input area when messages are pending - Clear queue: Press
Ctrl+Kto discard all queued messages - Merging: Consecutive messages within 500ms are merged by newline
When the queue is full (10 messages), new input is silently dropped until space becomes available.
Channel Selection Logic
Zeph selects the channel at startup based on the following priority:
--tuiflag orZEPH_TUI=true→ TUI channel (requirestuifeature)ZEPH_TELEGRAM_TOKENset → Telegram channel- Otherwise → CLI channel
Only one channel is active per session.
Tool System
Zeph provides a typed tool system that gives the LLM structured access to file operations, shell commands, and web scraping. Each executor owns its tool definitions with schemas derived from Rust structs via schemars, ensuring a single source of truth between deserialization and prompt generation.
Tool Registry
Each tool executor declares its definitions via tool_definitions(). On every LLM turn the agent collects all definitions into a ToolRegistry and renders them into the system prompt as a <tools> catalog. Tool parameter schemas are auto-generated from Rust structs using #[derive(JsonSchema)] from the schemars crate.
| Tool ID | Description | Invocation | Required Parameters | Optional Parameters |
|---|---|---|---|---|
bash | Execute a shell command | ```bash | command (string) | |
read | Read file contents | ToolCall | path (string) | offset (integer), limit (integer) |
edit | Replace a string in a file | ToolCall | path (string), old_string (string), new_string (string) | |
write | Write content to a file | ToolCall | path (string), content (string) | |
glob | Find files matching a glob pattern | ToolCall | pattern (string) | |
grep | Search file contents with regex | ToolCall | pattern (string) | path (string), case_sensitive (boolean) |
web_scrape | Scrape data from a web page via CSS selectors | ```scrape | url (string), select (string) | extract (string), limit (integer) |
FileExecutor
FileExecutor handles the file-oriented tools (read, write, edit, glob, grep) in a sandboxed environment. All file paths are validated against an allowlist before any I/O operation.
- If
allowed_pathsis empty, the sandbox defaults to the current working directory. - Paths are resolved via ancestor-walk canonicalization to prevent traversal attacks on non-existing paths.
globresults are filtered post-match to exclude files outside the sandbox.grepvalidates the search directory before scanning.
See Security for details on the path validation mechanism.
Dual-Mode Execution
The agent loop supports two tool invocation modes, distinguished by InvocationHint on each ToolDef:
- Fenced block (
InvocationHint::FencedBlock("bash")/FencedBlock("scrape")) — the LLM emits a fenced code block with the specified tag.ShellExecutorhandles```bashblocks,WebScrapeExecutorhandles```scrapeblocks containing JSON with CSS selectors. - Structured tool call (
InvocationHint::ToolCall) — the LLM emits aToolCallwithtool_idand typedparams.CompositeExecutorroutes the call toFileExecutorfor file tools.
Both modes coexist in the same iteration. The system prompt includes invocation instructions per tool so the LLM knows exactly which format to use.
Iteration Control
The agent loop iterates tool execution until the LLM produces a response with no tool invocations, or one of the safety limits is hit.
Iteration cap
Controlled by max_tool_iterations (default: 10). The previous hardcoded limit of 3 is replaced by this configurable value.
[agent]
max_tool_iterations = 10
Environment variable: ZEPH_AGENT_MAX_TOOL_ITERATIONS.
Doom-loop detection
If 3 consecutive tool iterations produce identical output strings, the loop breaks and the agent notifies the user. This prevents infinite loops where the LLM repeatedly issues the same failing command.
Context budget check
At the start of each iteration, the agent estimates total token usage. If usage exceeds 80% of the configured context_budget_tokens, the loop stops to avoid exceeding the model’s context window.
Permissions
The [tools.permissions] section defines pattern-based access control per tool. Each tool ID maps to an ordered array of rules. Rules use glob patterns matched case-insensitively against the tool input (command string for bash, file path for file tools). First matching rule wins; if no rule matches, the default action is Ask.
Three actions are available:
| Action | Behavior |
|---|---|
allow | Execute silently without confirmation |
ask | Prompt the user for confirmation before execution |
deny | Block execution; denied tools are hidden from the LLM system prompt |
[tools.permissions.bash]
[[tools.permissions.bash]]
pattern = "*sudo*"
action = "deny"
[[tools.permissions.bash]]
pattern = "cargo *"
action = "allow"
[[tools.permissions.bash]]
pattern = "*"
action = "ask"
When [tools.permissions] is absent, legacy blocked_commands and confirm_patterns from [tools.shell] are automatically converted to equivalent permission rules (deny and ask respectively).
Output Overflow
Tool output exceeding 30 000 characters is truncated (head + tail split) before being sent to the LLM. The full untruncated output is saved to ~/.zeph/data/tool-output/{uuid}.txt, and the truncated message includes the file path so the LLM can read the complete output if needed.
Stale overflow files older than 24 hours are cleaned up automatically on startup.
Configuration
[agent]
max_tool_iterations = 10 # Max tool loop iterations (default: 10)
[tools]
enabled = true
summarize_output = false
[tools.shell]
timeout = 30
allowed_paths = [] # Sandbox directories (empty = cwd only)
[tools.file]
allowed_paths = [] # Sandbox directories for file tools (empty = cwd only)
# Pattern-based permissions (optional; overrides legacy blocked_commands/confirm_patterns)
# [tools.permissions.bash]
# [[tools.permissions.bash]]
# pattern = "cargo *"
# action = "allow"
The tools.file.allowed_paths setting controls which directories FileExecutor can access for read, write, edit, glob, and grep operations. Shell and file sandboxes are configured independently.
| Variable | Description |
|---|---|
ZEPH_AGENT_MAX_TOOL_ITERATIONS | Max tool loop iterations (default: 10) |
TUI Dashboard
Zeph includes an optional ratatui-based Terminal User Interface that replaces the plain CLI with a rich dashboard showing real-time agent metrics, conversation history, and an always-visible input line.
Enabling
The TUI requires the tui feature flag (disabled by default):
cargo build --release --features tui
Running
# Via CLI argument
zeph --tui
# Via environment variable
ZEPH_TUI=true zeph
Layout
+-------------------------------------------------------------+
| Zeph v0.9.5 | Provider: orchestrator | Model: claude-son... |
+----------------------------------------+--------------------+
| | Skills (3/15) |
| | - setup-guide |
| | - git-workflow |
| | |
| [user] Can you check my code? +--------------------+
| | Memory |
| [zeph] Sure, let me look at | SQLite: 142 msgs |
| the code structure... | Qdrant: connected |
| ▲+--------------------+
+----------------------------------------+--------------------+
| You: write a rust function for fibon_ |
+-------------------------------------------------------------+
| [Insert] | Skills: 3 | Tokens: 4.2k | Qdrant: OK | 2m 15s |
+-------------------------------------------------------------+
- Chat panel (left 70%): bottom-up message feed with full markdown rendering (bold, italic, code blocks, lists, headings), scrollbar with proportional thumb, and scroll indicators (▲/▼). Mouse wheel scrolling supported
- Side panels (right 30%): skills, memory, and resources metrics — hidden on terminals < 80 cols
- Input line: always visible, supports multiline input via Shift+Enter. Shows
[+N queued]badge when messages are pending - Status bar: mode indicator, skill count, token usage, uptime
- Splash screen: colored block-letter “ZEPH” banner on startup
Keybindings
Normal Mode
| Key | Action |
|---|---|
i | Enter Insert mode (focus input) |
q | Quit application |
Ctrl+C | Quit application |
Up / k | Scroll chat up |
Down / j | Scroll chat down |
Page Up/Down | Scroll chat one page |
Home / End | Scroll to top / bottom |
Mouse wheel | Scroll chat up/down (3 lines per tick) |
d | Toggle side panels on/off |
Tab | Cycle side panel focus |
Insert Mode
| Key | Action |
|---|---|
Enter | Submit input to agent |
Shift+Enter | Insert newline (multiline input) |
Escape | Switch to Normal mode |
Ctrl+C | Quit application |
Ctrl+U | Clear input line |
Ctrl+K | Clear message queue |
Confirmation Modal
When a destructive command requires confirmation, a modal overlay appears:
| Key | Action |
|---|---|
Y / Enter | Confirm action |
N / Escape | Cancel action |
All other keys are blocked while the modal is visible.
Markdown Rendering
Chat messages are rendered with full markdown support via pulldown-cmark:
| Element | Rendering |
|---|---|
**bold** | Bold modifier |
*italic* | Italic modifier |
`inline code` | Blue text with dark background glow |
| Code blocks | Green text with dimmed language tag |
# Heading | Bold + underlined |
- list item | Green bullet (•) prefix |
> blockquote | Dimmed vertical bar (│) prefix |
~~strikethrough~~ | Crossed-out modifier |
--- | Horizontal rule (─) |
Thinking Blocks
When using Ollama models that emit reasoning traces (DeepSeek, Qwen), the <think>...</think> segments are rendered in a darker color (DarkGray) to visually separate model reasoning from the final response. Incomplete thinking blocks during streaming are also shown in the darker style.
Conversation History
On startup, the TUI loads the latest conversation from SQLite and displays it in the chat panel. This provides continuity across sessions.
Message Queueing
The TUI input line remains interactive during model inference, allowing you to queue up to 10 messages for sequential processing. This is useful for providing follow-up instructions without waiting for the current response to complete.
Queue Indicator
When messages are pending, a badge appears in the input area:
You: next message here [+3 queued]_
The counter shows how many messages are waiting to be processed. Queued messages are drained automatically after each response completes.
Message Merging
Consecutive messages submitted within 500ms are automatically merged with newline separators. This reduces context fragmentation when you send rapid-fire instructions.
Clearing the Queue
Press Ctrl+K in Insert mode to discard all queued messages. This is useful if you change your mind about pending instructions.
Alternatively, send the /clear-queue command to clear the queue programmatically.
Queue Limits
The queue holds a maximum of 10 messages. When full, new input is silently dropped until the agent drains the queue by processing pending messages.
Responsive Layout
The TUI adapts to terminal width:
| Width | Layout |
|---|---|
| >= 80 cols | Full layout: chat (70%) + side panels (30%) |
| < 80 cols | Side panels hidden, chat takes full width |
Live Metrics
The TUI dashboard displays real-time metrics collected from the agent loop via tokio::sync::watch channel:
| Panel | Metrics |
|---|---|
| Skills | Active/total skill count, matched skill names per query |
| Memory | SQLite message count, conversation ID, Qdrant status, embeddings generated, summaries count, tool output prunes |
| Resources | Prompt/completion/total tokens, API calls, last LLM latency (ms), provider and model name |
Metrics are updated at key instrumentation points in the agent loop:
- After each LLM call (api_calls, latency, prompt tokens)
- After streaming completes (completion tokens)
- After skill matching (active skills, total skills)
- After message persistence (sqlite message count)
- After summarization (summaries count)
Token counts use a chars/4 estimation (sufficient for dashboard display).
Deferred Model Warmup
When running with Ollama (or an orchestrator with Ollama sub-providers), model warmup is deferred until after the TUI interface renders. This means:
- The TUI appears immediately — no blank terminal while the model loads into GPU/CPU memory
- A status indicator (“warming up model…”) appears in the chat panel
- Warmup runs in the background via a spawned tokio task
- Once complete, the status updates to “model ready” and the agent loop begins processing
If you send a message before warmup finishes, it is queued and processed automatically once the model is ready.
Note: In non-TUI modes (CLI, Telegram), warmup still runs synchronously before the agent loop starts.
Architecture
The TUI runs as three concurrent loops:
- Crossterm event reader — dedicated OS thread (
std::thread), sends key/tick/resize events via mpsc - TUI render loop — tokio task, draws frames at 10 FPS via
tokio::select!, pollswatch::Receiverfor latest metrics before each draw - Agent loop — existing
Agent::run(), communicates viaTuiChanneland emits metrics viawatch::Sender
TuiChannel implements the Channel trait, so it plugs into the agent with zero changes to the generic signature. MetricsSnapshot and MetricsCollector live in zeph-core to avoid circular dependencies — zeph-tui re-exports them.
Tracing
When TUI is active, tracing output is redirected to zeph.log to avoid corrupting the terminal display.
Docker
Docker images are built without the tui feature by default (headless operation). To build a Docker image with TUI support:
docker build -f Dockerfile.dev --build-arg CARGO_FEATURES=tui -t zeph:tui .
Code Indexing
AST-based code indexing and semantic retrieval for project-aware context. The zeph-index crate parses source files via tree-sitter, chunks them by AST structure, embeds the chunks in Qdrant, and retrieves relevant code via hybrid search (semantic + grep routing) for injection into the agent context window.
Disabled by default. Enable via [index] enabled = true in config.
Why Code RAG
Cloud models with 200K token windows can afford multi-round agentic grep. Local models with 8K-32K windows cannot: a single grep cycle costs ~2K tokens (25% of an 8K budget), while 5 rounds would exceed the entire context. RAG retrieves 6-8 relevant chunks in ~3K tokens, preserving budget for history and response.
For cloud models, code RAG serves as pre-fill context alongside agentic search. For local models, it is the primary code retrieval mechanism.
Setup
-
Start Qdrant (required for vector storage):
docker compose up -d qdrant -
Enable indexing in config:
[index] enabled = true -
Index your project:
zeph indexOr let auto-indexing handle it on startup when
auto_index = true(default).
Architecture
The zeph-index crate contains 7 modules:
| Module | Purpose |
|---|---|
languages | Language detection from file extensions, tree-sitter grammar registry |
chunker | AST-based chunking with greedy sibling merge (cAST-inspired algorithm) |
context | Contextualized embedding text generation (file path + scope + imports + code) |
store | Dual-write storage: Qdrant vectors + SQLite chunk metadata |
indexer | Orchestrator: walk project tree, chunk files, embed, store with incremental change detection |
retriever | Query classification, semantic search, budget-aware chunk packing |
repo_map | Compact structural map of the project (signatures only, no function bodies) |
Pipeline
Source files
|
v
[languages.rs] detect language, load grammar
|
v
[chunker.rs] parse AST, split into chunks (target: ~600 non-ws chars)
|
v
[context.rs] prepend file path, scope chain, imports, language tag
|
v
[indexer.rs] embed via LlmProvider, skip unchanged (content hash)
|
v
[store.rs] upsert to Qdrant (vectors) + SQLite (metadata)
Retrieval
User query
|
v
[retriever.rs] classify_query()
|
+--> Semantic --> embed query --> Qdrant search --> budget pack --> inject
|
+--> Grep --> return empty (agent uses bash tools)
|
+--> Hybrid --> semantic search + hint to agent
Query Classification
The retriever classifies each query to route it to the appropriate search strategy:
| Strategy | Trigger | Action |
|---|---|---|
| Grep | Exact symbols: ::, fn , struct , CamelCase, snake_case identifiers | Agent handles via shell grep/ripgrep |
| Semantic | Conceptual queries: “how”, “where”, “why”, “explain” | Vector similarity search in Qdrant |
| Hybrid | Both symbol patterns and conceptual words | Semantic search + hint that grep may also help |
Default (no pattern match): Semantic.
AST-Based Chunking
Files are parsed via tree-sitter into AST, then chunked by entity boundaries (functions, structs, classes, impl blocks). The algorithm uses greedy sibling merge:
- Target size: 600 non-whitespace characters (~300-400 tokens)
- Max size: 1200 non-ws chars (forced recursive split)
- Min size: 100 non-ws chars (merge with adjacent sibling)
Config files (TOML, JSON, Markdown, Bash) are indexed as single file-level chunks since they lack named entities.
Each chunk carries rich metadata: file path, language, AST node type, entity name, line range, scope chain (e.g. MyStruct > impl MyStruct > my_method), imports, and a BLAKE3 content hash for change detection.
Contextualized Embeddings
Embedding raw code alone yields poor retrieval quality for conceptual queries. Before embedding, each chunk is prepended with:
- File path (
# src/agent.rs) - Scope chain (
# Scope: Agent > prepare_context) - Language tag (
# Language: rust) - First 5 import/use statements
This contextualized form improves retrieval for queries like “where is auth handled?” where the code alone might not contain the word “auth”.
Storage
Chunks are dual-written to two stores:
| Store | Data | Purpose |
|---|---|---|
Qdrant (zeph_code_chunks) | Embedding vectors + payload (code, metadata) | Semantic similarity search |
SQLite (chunk_metadata) | File path, content hash, line range, language, node type | Change detection, cleanup of deleted files |
The Qdrant collection uses INT8 scalar quantization for ~4x memory reduction with minimal accuracy loss. Payload indexes on language, file_path, and node_type enable filtered search.
Incremental Indexing
On subsequent runs, the indexer skips unchanged chunks by checking BLAKE3 content hashes in SQLite. Only modified or new files are re-embedded. Deleted files are detected by comparing the current file set against the SQLite index, and their chunks are removed from both stores.
File Watcher
When watch = true (default), an IndexWatcher monitors project files for changes during the session. On file modification, the changed file is automatically re-indexed via reindex_file() without rebuilding the entire index. The watcher uses 1-second debounce to batch rapid changes and only processes files with indexable extensions.
Disable with:
[index]
watch = false
Repo Map
A lightweight structural map of the project, generated via tree-sitter signature extraction (no function bodies). Included in the system prompt and cached with a configurable TTL (default: 5 minutes) to avoid per-message filesystem traversal.
Example output:
<repo_map>
src/agent.rs :: struct:Agent, impl:Agent, fn:new, fn:run, fn:prepare_context
src/config.rs :: struct:Config, fn:load
src/main.rs :: fn:main, fn:setup_logging
... and 12 more files
</repo_map>
The map is budget-constrained (default: 1024 tokens) and sorted by symbol count (files with more symbols appear first). It gives the model a structural overview of the project without consuming significant context.
Budget-Aware Retrieval
Retrieved chunks are packed into a token budget (default: 40% of available context for code). Chunks are sorted by similarity score and greedily packed until the budget is exhausted. A minimum score threshold (default: 0.25) filters low-relevance results.
Retrieved code is injected as a transient <code_context> XML block before the conversation history. It is re-generated on every turn and never persisted.
Context Window Layout (with Code RAG)
When code indexing is enabled, the context window includes two additional sections:
+---------------------------------------------------+
| System prompt + environment + ZEPH.md |
+---------------------------------------------------+
| <repo_map> (structural overview, cached) | <= 1024 tokens
+---------------------------------------------------+
| <available_skills> |
+---------------------------------------------------+
| <code_context> (per-query RAG chunks, transient) | <= 30% available
+---------------------------------------------------+
| [semantic recall] past messages | <= 10% available
+---------------------------------------------------+
| Recent message history | <= 50% available
+---------------------------------------------------+
| [response reserve] | 20% of total
+---------------------------------------------------+
Configuration
[index]
# Enable codebase indexing for semantic code search.
# Requires Qdrant running (uses separate collection "zeph_code_chunks").
enabled = false
# Auto-index on startup and re-index changed files during session.
auto_index = true
# Directories to index (relative to cwd).
paths = ["."]
# Patterns to exclude (in addition to .gitignore).
exclude = ["target", "node_modules", ".git", "vendor", "dist", "build", "__pycache__"]
# Token budget for repo map in system prompt (0 = no repo map).
repo_map_budget = 1024
# Cache TTL for repo map in seconds (avoids per-message regeneration).
repo_map_ttl_secs = 300
[index.chunker]
# Target chunk size in non-whitespace characters (~300-400 tokens).
target_size = 600
# Maximum chunk size before forced split.
max_size = 1200
# Minimum chunk size — smaller chunks merge with siblings.
min_size = 100
[index.retrieval]
# Maximum chunks to fetch from Qdrant (before budget packing).
max_chunks = 12
# Minimum cosine similarity score to accept.
score_threshold = 0.25
# Maximum fraction of available context budget for code chunks.
budget_ratio = 0.40
Supported Languages
Language support is controlled by feature flags on the zeph-index crate. All default features are enabled when the index binary feature is active.
| Language | Feature | Extensions |
|---|---|---|
| Rust | lang-rust | .rs |
| Python | lang-python | .py, .pyi |
| JavaScript | lang-js | .js, .jsx, .mjs, .cjs |
| TypeScript | lang-js | .ts, .tsx, .mts, .cts |
| Go | lang-go | .go |
| Bash | lang-config | .sh, .bash, .zsh |
| TOML | lang-config | .toml |
| JSON | lang-config | .json, .jsonc |
| Markdown | lang-config | .md, .markdown |
Environment Variables
| Variable | Description | Default |
|---|---|---|
ZEPH_INDEX_ENABLED | Enable code indexing | false |
ZEPH_INDEX_AUTO_INDEX | Auto-index on startup | true |
ZEPH_INDEX_REPO_MAP_BUDGET | Token budget for repo map | 1024 |
ZEPH_INDEX_REPO_MAP_TTL_SECS | Cache TTL for repo map in seconds | 300 |
Embedding Model Recommendations
The indexer uses the same LlmProvider.embed() as semantic memory. Any embedding model works. For code-heavy workloads:
| Model | Dims | Notes |
|---|---|---|
qwen3-embedding | 1024 | Current Zeph default, good general performance |
nomic-embed-text | 768 | Lightweight universal model |
nomic-embed-code | 768 | Optimized for code, higher RAM (~7.5GB) |
Architecture Overview
Cargo workspace (Edition 2024, resolver 3) with 10 crates + binary root.
Requires Rust 1.88+. Native async traits are used throughout — no async-trait crate.
Workspace Layout
zeph (binary) — thin bootstrap glue
├── zeph-core Agent loop, config, config hot-reload, channel trait, context builder
├── zeph-llm LlmProvider trait, Ollama + Claude + OpenAI + Candle backends, orchestrator, embeddings
├── zeph-skills SKILL.md parser, registry with lazy body loading, embedding matcher, resource resolver, hot-reload
├── zeph-memory SQLite + Qdrant, SemanticMemory orchestrator, summarization
├── zeph-channels Telegram adapter (teloxide) with streaming
├── zeph-tools ToolExecutor trait, ShellExecutor, WebScrapeExecutor, CompositeExecutor
├── zeph-index AST-based code indexing, hybrid retrieval, repo map (optional)
├── zeph-mcp MCP client via rmcp, multi-server lifecycle, unified tool matching (optional)
├── zeph-a2a A2A protocol client + server, agent discovery, JSON-RPC 2.0 (optional)
└── zeph-tui ratatui TUI dashboard with real-time metrics (optional)
Dependency Graph
zeph (binary)
└── zeph-core (orchestrates everything)
├── zeph-llm (leaf)
├── zeph-skills (leaf)
├── zeph-memory (leaf)
├── zeph-channels (leaf)
├── zeph-tools (leaf)
├── zeph-index (optional, leaf)
├── zeph-mcp (optional, leaf)
├── zeph-a2a (optional, leaf)
└── zeph-tui (optional, leaf)
zeph-core is the only crate that depends on other workspace crates. All leaf crates are independent and can be tested in isolation.
Agent Loop
The agent loop processes user input in a continuous cycle:
- Read initial user message via
channel.recv() - Build context from skills, memory, and environment
- Stream LLM response token-by-token
- Execute any tool calls in the response
- Drain queued messages (if any) via
channel.try_recv()and repeat from step 2
Queued messages are processed sequentially with full context rebuilding between each. Consecutive messages within 500ms are merged to reduce fragmentation. The queue holds a maximum of 10 messages; older messages are dropped when full.
Key Design Decisions
- Generic Agent:
Agent<P: LlmProvider + Clone + 'static, C: Channel, T: ToolExecutor>— fully generic over provider, channel, and tool executor - TLS: rustls everywhere (no openssl-sys)
- Errors:
thiserrorfor library crates,anyhowfor application code (zeph-core,main.rs) - Lints: workspace-level
clippy::all+clippy::pedantic+clippy::nursery;unsafe_code = "deny" - Dependencies: versions only in root
[workspace.dependencies]; crates inherit viaworkspace = true - Feature gates: optional crates (
zeph-index,zeph-mcp,zeph-a2a,zeph-tui) are feature-gated in the binary - Context engineering: proportional budget allocation, semantic recall injection, message trimming, runtime compaction, environment context injection, progressive skill loading, ZEPH.md project config discovery
Crates
Each workspace crate has a focused responsibility. All leaf crates are independent and testable in isolation; only zeph-core depends on other workspace members.
zeph-core
Agent loop, configuration loading, and context builder.
Agent<P, C, T>— main agent loop with streaming support, message queue drain, configurablemax_tool_iterations(default 10), doom-loop detection, and context budget check (stops at 80% threshold)Config— TOML config loading with env var overridesChanneltrait — abstraction for I/O (CLI, Telegram, TUI) withrecv(),try_recv(),send_queue_count()for queue management- Context builder — assembles system prompt from skills, memory, summaries, environment, and project config
- Context engineering — proportional budget allocation, semantic recall injection, message trimming, runtime compaction
EnvironmentContext— runtime gathering of cwd, git branch, OS, model nameproject.rs— ZEPH.md config discovery (walk up directory tree)VaultProvidertrait — pluggable secret resolutionMetricsSnapshot/MetricsCollector— real-time metrics viatokio::sync::watchfor TUI dashboard
zeph-llm
LLM provider abstraction and backend implementations.
LlmProvidertrait —chat(),chat_stream(),embed(),supports_streaming(),supports_embeddings()OllamaProvider— local inference via ollama-rsClaudeProvider— Anthropic Messages API with SSE streamingOpenAiProvider— OpenAI + compatible APIs (raw reqwest)CandleProvider— local GGUF model inference via candleAnyProvider— enum dispatch for runtime provider selectionModelOrchestrator— task-based multi-model routing with fallback chains
zeph-skills
SKILL.md loader, skill registry, and prompt formatter.
SkillMeta/Skill— metadata + lazy body loading viaOnceLockSkillRegistry— manages skill lifecycle, lazy body accessSkillMatcher— in-memory cosine similarity matchingQdrantSkillMatcher— persistent embeddings with BLAKE3 delta syncformat_skills_prompt()— assembles prompt with OS-filtered resourcesformat_skills_catalog()— description-only entries for non-matched skillsresource.rs—discover_resources()+load_resource()with path traversal protection- Filesystem watcher for hot-reload (500ms debounce)
zeph-memory
SQLite-backed conversation persistence with Qdrant vector search.
SqliteStore— conversations, messages, summaries, skill usage, skill versionsQdrantStore— vector storage and cosine similarity searchSemanticMemory<P>— orchestrator coordinating SQLite + Qdrant + LlmProvider- Automatic collection creation, graceful degradation without Qdrant
zeph-channels
Channel implementations for the Zeph agent.
CliChannel— stdin/stdout with immediate streaming output, blocking recv (queue always empty)TelegramChannel— teloxide adapter with MarkdownV2 rendering, streaming via edit-in-place, user whitelisting, inline confirmation keyboards, mpsc-backed message queue with 500ms merge window
zeph-tools
Tool execution abstraction and shell backend.
ToolExecutortrait — accepts LLM response or structuredToolCall, returns tool outputToolRegistry— typed definitions for 7 built-in tools (bash, read, edit, write, glob, grep, web_scrape), injected into system prompt as<tools>catalogToolCall/execute_tool_call()— structured tool invocation with typed parameters alongside legacy bash extraction (dual-mode)FileExecutor— sandboxed file operations (read, write, edit, glob, grep) with ancestor-walk path canonicalizationShellExecutor— bash block parser, command safety filter, sandbox validationWebScrapeExecutor— HTML scraping with CSS selectors, SSRF protectionCompositeExecutor<A, B>— generic chaining with first-match-wins dispatch, routes structured tool calls bytool_idto the appropriate backendAuditLogger— structured JSON audit trail for all executionstruncate_tool_output()— head+tail split at 30K chars with UTF-8 safe boundaries
zeph-index
AST-based code indexing, semantic retrieval, and repo map generation (optional, feature-gated).
Langenum — supported languages with tree-sitter grammar registry, feature-gated per language groupchunk_file()— AST-based chunking with greedy sibling merge, scope chains, import extractioncontextualize_for_embedding()— prepends file path, scope, language, imports to code for better embedding qualityCodeStore— dual-write storage: Qdrant vectors (zeph_code_chunkscollection) + SQLite metadata with BLAKE3 content-hash change detectionCodeIndexer<P>— project indexer orchestrator: walk, chunk, embed, store with incremental skip of unchanged chunksCodeRetriever<P>— hybrid retrieval with query classification (Semantic / Grep / Hybrid), budget-aware chunk packinggenerate_repo_map()— compact structural view via tree-sitter signature extraction, budget-constrained
zeph-mcp
MCP client for external tool servers (optional, feature-gated).
McpClient/McpManager— multi-server lifecycle managementMcpToolExecutor— tool execution via MCP protocolMcpToolRegistry— tool embeddings in Qdrant with delta sync- Dual transport: Stdio (child process) and HTTP (Streamable HTTP)
- Dynamic server management via
/mcp add,/mcp remove
zeph-a2a
A2A protocol client and server (optional, feature-gated).
A2aClient— JSON-RPC 2.0 client with SSE streamingAgentRegistry— agent card discovery with TTL cacheAgentCardBuilder— construct agent cards from runtime config- A2A Server — axum-based HTTP server with bearer auth, rate limiting, body size limits
TaskManager— in-memory task lifecycle management
zeph-tui
ratatui-based TUI dashboard (optional, feature-gated).
TuiChannel— Channel trait implementation bridging agent loop and TUI render loop via mpsc, oneshot-based confirmation dialog, bounded message queue (max 10) with 500ms merge windowApp— TUI state machine with Normal/Insert/Confirm modes, keybindings, scroll, live metrics polling viawatch::Receiver, queue badge indicator[+N queued], Ctrl+K to clear queueEventReader— crossterm event loop on dedicated OS thread (avoids tokio starvation)- Side panel widgets:
skills(active/total),memory(SQLite, Qdrant, embeddings),resources(tokens, API calls, latency) - Chat widget with bottom-up message feed, pulldown-cmark markdown rendering, scrollbar with proportional thumb, mouse scroll, thinking block segmentation, and streaming cursor
- Splash screen widget with colored block-letter banner
- Conversation history loading from SQLite on startup
- Confirmation modal overlay widget with Y/N keybindings and focus capture
- Responsive layout: side panels hidden on terminals < 80 cols
- Multiline input via Shift+Enter
- Status bar with mode, skill count, tokens, Qdrant status, uptime
- Panic hook for terminal state restoration
- Re-exports
MetricsSnapshot/MetricsCollectorfrom zeph-core
Token Efficiency
Zeph’s prompt construction is designed to minimize token usage regardless of how many skills and MCP tools are installed.
The Problem
Naive AI agent implementations inject all available tools and instructions into every prompt. With 50 skills and 100 MCP tools, this means thousands of tokens consumed on every request — most of which are irrelevant to the user’s query.
Zeph’s Approach
Embedding-Based Selection
Per query, only the top-K most relevant skills (default: 5) are selected via cosine similarity of vector embeddings. The same pipeline handles MCP tools.
User query → embed(query) → cosine_similarity(query, skills) → top-K → inject into prompt
This makes prompt size O(K) instead of O(N), where:
- K =
max_active_skills(default: 5, configurable) - N = total skills + MCP tools installed
Progressive Loading
Even selected skills don’t load everything at once:
| Stage | What loads | When | Token cost |
|---|---|---|---|
| Startup | Skill metadata (name, description) | Once | ~100 tokens per skill |
| Query | Skill body (instructions, examples) | On match | <5000 tokens per skill |
| Query | Resource files (references, scripts) | On match + OS filter | Variable |
Metadata is always in memory for matching. Bodies are loaded lazily via OnceLock and cached after first access. Resources are loaded on demand with OS filtering (e.g., linux.md only loads on Linux).
Two-Tier Skill Catalog
Non-matched skills are listed in a description-only <other_skills> catalog — giving the model awareness of all available capabilities without injecting their full bodies. This means the model can request a specific skill if needed, while consuming only ~20 tokens per unmatched skill instead of thousands.
MCP Tool Matching
MCP tools follow the same pipeline:
- Tools are embedded in Qdrant (
zeph_mcp_toolscollection) with BLAKE3 content-hash delta sync - Only re-embedded when tool definitions change
- Unified matching ranks both skills and MCP tools by relevance score
- Prompt contains only the top-K combined results
Practical Impact
| Scenario | Naive approach | Zeph |
|---|---|---|
| 10 skills, no MCP | ~50K tokens/prompt | ~25K tokens/prompt |
| 50 skills, 100 MCP tools | ~250K tokens/prompt | ~25K tokens/prompt |
| 200 skills, 500 MCP tools | ~1M tokens/prompt | ~25K tokens/prompt |
Prompt size stays constant as you add more capabilities. The only cost of more skills is a slightly larger embedding index in Qdrant or memory.
Two-Tier Context Pruning
Long conversations accumulate tool outputs that consume significant context space. Zeph uses a two-tier strategy: Tier 1 selectively prunes old tool outputs (cheap, no LLM call), and Tier 2 falls back to full LLM compaction only when Tier 1 is insufficient. See Context Engineering for details.
Configuration
[skills]
max_active_skills = 5 # Increase for broader context, decrease for faster/cheaper queries
export ZEPH_SKILLS_MAX_ACTIVE=3 # Override via env var
Security
Zeph implements defense-in-depth security for safe AI agent operations in production environments.
Shell Command Filtering
All shell commands from LLM responses pass through a security filter before execution. Commands matching blocked patterns are rejected with detailed error messages.
12 blocked patterns by default:
| Pattern | Risk Category | Examples |
|---|---|---|
rm -rf /, rm -rf /* | Filesystem destruction | Prevents accidental system wipe |
sudo, su | Privilege escalation | Blocks unauthorized root access |
mkfs, fdisk | Filesystem operations | Prevents disk formatting |
dd if=, dd of= | Low-level disk I/O | Blocks dangerous write operations |
curl | bash, wget | sh | Arbitrary code execution | Prevents remote code injection |
nc, ncat, netcat | Network backdoors | Blocks reverse shell attempts |
shutdown, reboot, halt | System control | Prevents service disruption |
Configuration:
[tools.shell]
timeout = 30
blocked_commands = ["custom_pattern"] # Additional patterns (additive to defaults)
allowed_paths = ["/home/user/workspace"] # Restrict filesystem access
allow_network = true # false blocks curl/wget/nc
confirm_patterns = ["rm ", "git push -f"] # Destructive command patterns
Custom blocked patterns are additive — you cannot weaken default security. Matching is case-insensitive.
Shell Sandbox
Commands are validated against a configurable filesystem allowlist before execution:
allowed_paths = [](default) restricts access to the working directory only- Paths are canonicalized to prevent traversal attacks (
../../etc/passwd) allow_network = falseblocks network tools (curl,wget,nc,ncat,netcat)
Destructive Command Confirmation
Commands matching confirm_patterns trigger an interactive confirmation before execution:
- CLI:
y/Nprompt on stdin - Telegram: inline keyboard with Confirm/Cancel buttons
- Default patterns:
rm,git push -f,git push --force,drop table,drop database,truncate - Configurable via
tools.shell.confirm_patternsin TOML
File Executor Sandbox
FileExecutor enforces the same allowed_paths sandbox as the shell executor for all file operations (read, write, edit, glob, grep).
Path validation:
- All paths are resolved to absolute form and canonicalized before access
- Non-existing paths (e.g., for
write) use ancestor-walk canonicalization: the resolver walks up the path tree to the nearest existing ancestor, canonicalizes it, then re-appends the remaining segments. This prevents symlink and..traversal on paths that do not yet exist on disk - If the resolved path does not fall under any entry in
allowed_paths, the operation is rejected with aSandboxViolationerror
Glob and grep enforcement:
globresults are post-filtered: matched paths outside the sandbox are silently excludedgrepvalidates the search root directory before scanning begins
Configuration is shared with the shell sandbox:
[tools.shell]
allowed_paths = ["/home/user/workspace"] # Empty = cwd only
Permission Policy
The [tools.permissions] config section provides fine-grained, pattern-based access control for each tool. Rules are evaluated in order (first match wins) using case-insensitive glob patterns against the tool input. See Tool System — Permissions for configuration details.
Key security properties:
- Tools with all-deny rules are excluded from the LLM system prompt, preventing the model from attempting to use them
- Legacy
blocked_commandsandconfirm_patternsare auto-migrated to equivalent permission rules when[tools.permissions]is absent - Default action when no rule matches is
Ask(confirmation required)
Audit Logging
Structured JSON audit log for all tool executions:
[tools.audit]
enabled = true
destination = "./data/audit.jsonl" # or "stdout"
Each entry includes timestamp, tool name, command, result (success/blocked/error/timeout), and duration in milliseconds.
Secret Redaction
LLM responses are scanned for common secret patterns before display:
- Detected patterns:
sk-,AKIA,ghp_,gho_,xoxb-,xoxp-,sk_live_,sk_test_,-----BEGIN - Secrets replaced with
[REDACTED]preserving original whitespace formatting - Enabled by default (
security.redact_secrets = true), applied to both streaming and non-streaming responses
Timeout Policies
Configurable per-operation timeouts prevent hung connections:
[timeouts]
llm_seconds = 120 # LLM chat completion
embedding_seconds = 30 # Embedding generation
a2a_seconds = 30 # A2A remote calls
A2A Network Security
- TLS enforcement:
a2a.require_tls = truerejects HTTP endpoints (HTTPS only) - SSRF protection:
a2a.ssrf_protection = trueblocks private IP ranges (RFC 1918, loopback, link-local) via DNS resolution - Payload limits:
a2a.max_body_sizecaps request body (default: 1 MiB)
Safe execution model:
- Commands parsed for blocked patterns, then sandbox-validated, then confirmation-checked
- Timeout enforcement (default: 30s, configurable)
- Full errors logged to system, sanitized messages shown to users
- Audit trail for all tool executions (when enabled)
Container Security
| Security Layer | Implementation | Status |
|---|---|---|
| Base image | Oracle Linux 9 Slim | Production-hardened |
| Vulnerability scanning | Trivy in CI/CD | 0 HIGH/CRITICAL CVEs |
| User privileges | Non-root zeph user (UID 1000) | Enforced |
| Attack surface | Minimal package installation | Distroless-style |
Continuous security:
- Every release scanned with Trivy before publishing
- Automated Dependabot PRs for dependency updates
cargo-denychecks in CI for license/vulnerability compliance
Code Security
Rust-native memory safety guarantees:
- Minimal
unsafe: One auditedunsafeblock behindcandlefeature flag (memory-mapped safetensors loading). Core crates enforce#![deny(unsafe_code)] - No panic in production:
unwrap()andexpect()linted via clippy - Secure dependencies: All crates audited with
cargo-deny - MSRV policy: Rust 1.88+ (Edition 2024) for latest security patches
Reporting Vulnerabilities
Do not open a public issue. Use GitHub Security Advisories to submit a private report.
Include: description, steps to reproduce, potential impact, suggested fix. Expect an initial response within 72 hours.
Feature Flags
Zeph uses Cargo feature flags to control optional functionality. Default features cover common use cases; platform-specific and experimental features are opt-in.
| Feature | Default | Description |
|---|---|---|
a2a | Enabled | A2A protocol client and server for agent-to-agent communication |
openai | Enabled | OpenAI-compatible provider (GPT, Together, Groq, Fireworks, etc.) |
mcp | Enabled | MCP client for external tool servers via stdio/HTTP transport |
candle | Enabled | Local HuggingFace model inference via candle (GGUF quantized models) |
orchestrator | Enabled | Multi-model routing with task-based classification and fallback chains |
self-learning | Enabled | Skill evolution via failure detection, self-reflection, and LLM-generated improvements |
vault-age | Enabled | Age-encrypted vault backend for file-based secret storage (age) |
index | Enabled | AST-based code indexing and semantic retrieval via tree-sitter (guide) |
tui | Disabled | ratatui-based TUI dashboard with real-time agent metrics |
metal | Disabled | Metal GPU acceleration for candle on macOS (implies candle) |
cuda | Disabled | CUDA GPU acceleration for candle on Linux (implies candle) |
Build Examples
cargo build --release # all default features
cargo build --release --features metal # macOS with Metal GPU
cargo build --release --features cuda # Linux with NVIDIA GPU
cargo build --release --features tui # with TUI dashboard
cargo build --release --no-default-features # minimal binary
zeph-index Language Features
When index is enabled, tree-sitter grammars are controlled by sub-features on the zeph-index crate. All are enabled by default.
| Feature | Languages |
|---|---|
lang-rust | Rust |
lang-python | Python |
lang-js | JavaScript, TypeScript |
lang-go | Go |
lang-config | Bash, TOML, JSON, Markdown |
Contributing
Thank you for considering contributing to Zeph.
Getting Started
- Fork the repository
- Clone your fork and create a branch from
main - Install Rust 1.88+ (Edition 2024 required)
- Run
cargo buildto verify the setup
Development
Build
cargo build
Test
# Run unit tests only (exclude integration tests)
cargo nextest run --workspace --lib --bins
# Run all tests including integration tests (requires Docker)
cargo nextest run --workspace --profile ci
Nextest profiles (.config/nextest.toml):
default: Runs all tests (unit + integration)ci: CI environment, runs all tests with JUnit XML output for reporting
Integration Tests
Integration tests use testcontainers-rs to automatically spin up Docker containers for external services (Qdrant, etc.).
Prerequisites: Docker must be running on your machine.
# Run only integration tests
cargo nextest run --workspace --test '*integration*'
# Run unit tests only (skip integration tests)
cargo nextest run --workspace --lib --bins
# Run all tests
cargo nextest run --workspace
Integration test files are located in each crate’s tests/ directory and follow the *_integration.rs naming convention.
Lint
cargo +nightly fmt --check
cargo clippy --all-targets
Coverage
cargo llvm-cov --all-features --workspace
Workspace Structure
| Crate | Purpose |
|---|---|
zeph-core | Agent loop, config, channel trait |
zeph-llm | LlmProvider trait, Ollama + Claude + OpenAI + Candle backends |
zeph-skills | SKILL.md parser, registry, prompt formatter |
zeph-memory | SQLite conversation persistence, Qdrant vector search |
zeph-channels | Telegram adapter |
zeph-tools | Tool executor, shell sandbox, web scraper |
zeph-index | AST-based code indexing, semantic retrieval, repo map |
zeph-mcp | MCP client, multi-server lifecycle |
zeph-a2a | A2A protocol client and server |
zeph-tui | ratatui TUI dashboard with real-time metrics |
Pull Requests
- Create a feature branch:
feat/<scope>/<description>orfix/<scope>/<description> - Keep changes focused — one logical change per PR
- Add tests for new functionality
- Ensure all checks pass:
cargo +nightly fmt,cargo clippy,cargo nextest run --lib --bins - Write a clear PR description following the template
Commit Messages
- Use imperative mood: “Add feature” not “Added feature”
- Keep the first line under 72 characters
- Reference related issues when applicable
Code Style
- Follow workspace clippy lints (pedantic enabled)
- Use
cargo +nightly fmtfor formatting - Avoid unnecessary comments — code should be self-explanatory
- Comments are only for cognitively complex blocks
License
By contributing, you agree that your contributions will be licensed under the MIT License.
Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
See the full CHANGELOG.md in the repository for the complete version history.