Zeph

Lightweight AI agent with hybrid inference (Ollama / Claude / OpenAI / HuggingFace via candle), skills-first architecture, semantic memory with Qdrant, MCP client, A2A protocol support, multi-model orchestration, self-learning skill evolution, and multi-channel I/O.

Only relevant skills and MCP tools are injected into each prompt via vector similarity — keeping token usage minimal regardless of how many are installed.

Cross-platform: Linux, macOS, Windows (x86_64 + ARM64).

Key Features

Hybrid inference — Ollama (local), Claude (Anthropic), OpenAI (GPT + compatible APIs), Candle (HuggingFace GGUF)
Skills-first architecture — embedding-based skill matching selects only top-K relevant skills per query, not all
Semantic memory — SQLite for structured data + Qdrant for vector similarity search
MCP client — connect external tool servers via Model Context Protocol (stdio + HTTP transport)
A2A protocol — agent-to-agent communication via JSON-RPC 2.0 with SSE streaming
Model orchestrator — route tasks to different providers with automatic fallback chains
Self-learning — skills evolve through failure detection, self-reflection, and LLM-generated improvements
Code indexing — AST-based code RAG with tree-sitter, hybrid retrieval (semantic + grep routing), repo map
Context engineering — proportional budget allocation, semantic recall injection, runtime compaction, smart tool output summarization, ZEPH.md project config
Multi-channel I/O — CLI, Telegram, and TUI with streaming support
Token-efficient — prompt size is O(K) not O(N), where K is max active skills and N is total installed

Quick Start

git clone https://github.com/bug-ops/zeph
cd zeph
cargo build --release
./target/release/zeph

See Installation for pre-built binaries and Docker options.

Requirements

Rust 1.88+ (Edition 2024)
Ollama (for local inference and embeddings) or cloud API key (Claude / OpenAI)
Docker (optional, for Qdrant semantic memory and containerized deployment)

Installation

Install Zeph from source, pre-built binaries, or Docker.

From Source

git clone https://github.com/bug-ops/zeph
cd zeph
cargo build --release

The binary is produced at target/release/zeph.

Pre-built Binaries

Download from GitHub Releases:

Platform	Architecture	Download
Linux	x86_64	`zeph-x86_64-unknown-linux-gnu.tar.gz`
Linux	aarch64	`zeph-aarch64-unknown-linux-gnu.tar.gz`
macOS	x86_64	`zeph-x86_64-apple-darwin.tar.gz`
macOS	aarch64	`zeph-aarch64-apple-darwin.tar.gz`
Windows	x86_64	`zeph-x86_64-pc-windows-msvc.zip`

Docker

Pull the latest image from GitHub Container Registry:

docker pull ghcr.io/bug-ops/zeph:latest

Or use a specific version:

docker pull ghcr.io/bug-ops/zeph:v0.9.5

Images are scanned with Trivy in CI/CD and use Oracle Linux 9 Slim base with 0 HIGH/CRITICAL CVEs. Multi-platform: linux/amd64, linux/arm64.

See Docker Deployment for full deployment options including GPU support and age vault.

Quick Start

Run Zeph after building and interact via CLI, Telegram, or a cloud provider.

CLI Mode (default)

Unix (Linux/macOS):

./target/release/zeph

Windows:

.\target\release\zeph.exe

Type messages at the You: prompt. Type exit, quit, or press Ctrl-D to stop.

Telegram Mode

Unix (Linux/macOS):

ZEPH_TELEGRAM_TOKEN="123:ABC" ./target/release/zeph

Windows:

$env:ZEPH_TELEGRAM_TOKEN="123:ABC"; .\target\release\zeph.exe

Restrict access by setting telegram.allowed_users in the config file:

[telegram]
allowed_users = ["your_username"]

Ollama Setup

When using Ollama (default provider), ensure both the LLM model and embedding model are pulled:

ollama pull mistral:7b
ollama pull qwen3-embedding

The default configuration uses mistral:7b for text generation and qwen3-embedding for vector embeddings.

Cloud Providers

For Claude:

ZEPH_CLAUDE_API_KEY=sk-ant-... ./target/release/zeph

For OpenAI:

ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... ./target/release/zeph

See Configuration for the full reference.

Configuration

Zeph loads config/default.toml at startup and applies environment variable overrides.

The config path can be overridden via CLI argument or environment variable:

# CLI argument (highest priority)
zeph --config /path/to/custom.toml

# Environment variable
ZEPH_CONFIG=/path/to/custom.toml zeph

# Default (fallback)
# config/default.toml

Priority: --config > ZEPH_CONFIG > config/default.toml.

Hot-Reload

Zeph watches the config file for changes and applies runtime-safe fields without restart. The file watcher uses 500ms debounce to avoid redundant reloads.

Reloadable fields (applied immediately):

Section	Fields
`[security]`	`redact_secrets`
`[timeouts]`	`llm_seconds`, `embedding_seconds`, `a2a_seconds`
`[memory]`	`history_limit`, `summarization_threshold`, `context_budget_tokens`, `compaction_threshold`, `compaction_preserve_tail`, `prune_protect_tokens`, `cross_session_score_threshold`
`[memory.semantic]`	`recall_limit`
`[index]`	`repo_map_ttl_secs`, `watch`
`[agent]`	`max_tool_iterations`
`[skills]`	`max_active_skills`

Not reloadable (require restart): LLM provider/model, SQLite path, Qdrant URL, Telegram token, MCP servers, A2A config, skill paths.

Check for config reloaded in the log to confirm a successful reload.

Configuration File

[agent]
name = "Zeph"
max_tool_iterations = 10  # Max tool loop iterations per response (default: 10)

[llm]
provider = "ollama"
base_url = "http://localhost:11434"
model = "mistral:7b"
embedding_model = "qwen3-embedding"  # Model for text embeddings

[llm.cloud]
model = "claude-sonnet-4-5-20250929"
max_tokens = 4096

# [llm.openai]
# base_url = "https://api.openai.com/v1"
# model = "gpt-5.2"
# max_tokens = 4096
# embedding_model = "text-embedding-3-small"
# reasoning_effort = "medium"  # low, medium, high (for reasoning models)

[skills]
paths = ["./skills"]
max_active_skills = 5  # Top-K skills per query via embedding similarity

[memory]
sqlite_path = "./data/zeph.db"
history_limit = 50
summarization_threshold = 100  # Trigger summarization after N messages
context_budget_tokens = 0      # 0 = unlimited (proportional split: 15% summaries, 25% recall, 60% recent)
compaction_threshold = 0.75    # Compact when context usage exceeds this fraction
compaction_preserve_tail = 4   # Keep last N messages during compaction
prune_protect_tokens = 40000   # Protect recent N tokens from tool output pruning
cross_session_score_threshold = 0.35  # Minimum relevance for cross-session results

[memory.semantic]
enabled = false               # Enable semantic search via Qdrant
recall_limit = 5              # Number of semantically relevant messages to inject

[tools]
enabled = true
summarize_output = false      # LLM-based summarization for long tool outputs

[tools.shell]
timeout = 30
blocked_commands = []
allowed_commands = []
allowed_paths = []          # Directories shell can access (empty = cwd only)
allow_network = true        # false blocks curl/wget/nc
confirm_patterns = ["rm ", "git push -f", "git push --force", "drop table", "drop database", "truncate "]

[tools.file]
allowed_paths = []          # Directories file tools can access (empty = cwd only)

# Pattern-based permissions per tool (optional; overrides legacy blocked_commands/confirm_patterns)
# [tools.permissions.bash]
# [[tools.permissions.bash]]
# pattern = "*sudo*"
# action = "deny"
# [[tools.permissions.bash]]
# pattern = "cargo *"
# action = "allow"
# [[tools.permissions.bash]]
# pattern = "*"
# action = "ask"

[tools.scrape]
timeout = 15
max_body_bytes = 1048576  # 1MB

[tools.audit]
enabled = false             # Structured JSON audit log for tool executions
destination = "stdout"      # "stdout" or file path

[security]
redact_secrets = true       # Redact API keys/tokens in LLM responses

[timeouts]
llm_seconds = 120           # LLM chat completion timeout
embedding_seconds = 30      # Embedding generation timeout
a2a_seconds = 30            # A2A remote call timeout

[vault]
backend = "env"  # "env" (default) or "age"; CLI --vault overrides this

[a2a]
enabled = false
host = "0.0.0.0"
port = 8080
# public_url = "https://agent.example.com"
# auth_token = "secret"
rate_limit = 60

Shell commands are sandboxed with path restrictions, network control, and destructive command confirmation. See Security for details.

Environment Variables

Variable	Description
`ZEPH_LLM_PROVIDER`	`ollama`, `claude`, `openai`, `candle`, or `orchestrator`
`ZEPH_LLM_BASE_URL`	Ollama API endpoint
`ZEPH_LLM_MODEL`	Model name for Ollama
`ZEPH_LLM_EMBEDDING_MODEL`	Embedding model for Ollama (default: `qwen3-embedding`)
`ZEPH_CLAUDE_API_KEY`	Anthropic API key (required for Claude)
`ZEPH_OPENAI_API_KEY`	OpenAI API key (required for OpenAI provider)
`ZEPH_TELEGRAM_TOKEN`	Telegram bot token (enables Telegram mode)
`ZEPH_SQLITE_PATH`	SQLite database path
`ZEPH_QDRANT_URL`	Qdrant server URL (default: `http://localhost:6334`)
`ZEPH_MEMORY_SUMMARIZATION_THRESHOLD`	Trigger summarization after N messages (default: 100)
`ZEPH_MEMORY_CONTEXT_BUDGET_TOKENS`	Context budget for proportional token allocation (default: 0 = unlimited)
`ZEPH_MEMORY_COMPACTION_THRESHOLD`	Compaction trigger threshold as fraction of context budget (default: 0.75)
`ZEPH_MEMORY_COMPACTION_PRESERVE_TAIL`	Messages preserved during compaction (default: 4)
`ZEPH_MEMORY_PRUNE_PROTECT_TOKENS`	Tokens protected from Tier 1 tool output pruning (default: 40000)
`ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLD`	Minimum relevance score for cross-session memory (default: 0.35)
`ZEPH_MEMORY_SEMANTIC_ENABLED`	Enable semantic memory with Qdrant (default: false)
`ZEPH_MEMORY_RECALL_LIMIT`	Max semantically relevant messages to recall (default: 5)
`ZEPH_SKILLS_MAX_ACTIVE`	Max skills per query via embedding match (default: 5)
`ZEPH_AGENT_MAX_TOOL_ITERATIONS`	Max tool loop iterations per response (default: 10)
`ZEPH_TOOLS_SUMMARIZE_OUTPUT`	Enable LLM-based tool output summarization (default: false)
`ZEPH_TOOLS_TIMEOUT`	Shell command timeout in seconds (default: 30)
`ZEPH_TOOLS_SCRAPE_TIMEOUT`	Web scrape request timeout in seconds (default: 15)
`ZEPH_TOOLS_SCRAPE_MAX_BODY`	Max response body size in bytes (default: 1048576)
`ZEPH_A2A_ENABLED`	Enable A2A server (default: false)
`ZEPH_A2A_HOST`	A2A server bind address (default: `0.0.0.0`)
`ZEPH_A2A_PORT`	A2A server port (default: `8080`)
`ZEPH_A2A_PUBLIC_URL`	Public URL for agent card discovery
`ZEPH_A2A_AUTH_TOKEN`	Bearer token for A2A server authentication
`ZEPH_A2A_RATE_LIMIT`	Max requests per IP per minute (default: 60)
`ZEPH_A2A_REQUIRE_TLS`	Require HTTPS for outbound A2A connections (default: true)
`ZEPH_A2A_SSRF_PROTECTION`	Block private/loopback IPs in A2A client (default: true)
`ZEPH_A2A_MAX_BODY_SIZE`	Max request body size in bytes (default: 1048576)
`ZEPH_TOOLS_FILE_ALLOWED_PATHS`	Comma-separated directories file tools can access (empty = cwd)
`ZEPH_TOOLS_SHELL_ALLOWED_PATHS`	Comma-separated directories shell can access (empty = cwd)
`ZEPH_TOOLS_SHELL_ALLOW_NETWORK`	Allow network commands from shell (default: true)
`ZEPH_TOOLS_AUDIT_ENABLED`	Enable audit logging for tool executions (default: false)
`ZEPH_TOOLS_AUDIT_DESTINATION`	Audit log destination: `stdout` or file path
`ZEPH_SECURITY_REDACT_SECRETS`	Redact secrets in LLM responses (default: true)
`ZEPH_TIMEOUT_LLM`	LLM call timeout in seconds (default: 120)
`ZEPH_TIMEOUT_EMBEDDING`	Embedding generation timeout in seconds (default: 30)
`ZEPH_TIMEOUT_A2A`	A2A remote call timeout in seconds (default: 30)
`ZEPH_CONFIG`	Path to config file (default: `config/default.toml`)
`ZEPH_TUI`	Enable TUI dashboard: `true` or `1` (requires `tui` feature)

Skills

Zeph uses an embedding-based skill system that dramatically reduces token consumption: instead of injecting all skills into every prompt, only the top-K most relevant (default: 5) are selected per query via cosine similarity of vector embeddings. Combined with progressive loading (metadata at startup, bodies on activation, resources on demand), this keeps prompt size constant regardless of how many skills are installed.

How It Works

You send a message — for example, “check disk usage on this server”
Zeph embeds your query using the configured embedding model
Top matching skills are selected — by default, the 5 most relevant ones ranked by vector similarity
Selected skills are injected into the system prompt, giving Zeph specific instructions and examples for the task
Zeph responds using the knowledge from matched skills

This happens automatically on every message. You don’t need to activate skills manually.

Matching Backends

Zeph supports two skill matching backends:

In-memory (default) — embeddings are computed on startup and matched via cosine similarity. No external dependencies required.
Qdrant — when semantic memory is enabled and Qdrant is reachable, skill embeddings are persisted in a zeph_skills collection. On startup, only changed skills are re-embedded using BLAKE3 content hash comparison. If Qdrant becomes unavailable, Zeph falls back to in-memory matching automatically.

The Qdrant backend significantly reduces startup time when you have many skills, since unchanged skills skip the embedding step entirely.

Bundled Skills

Skill	Description
`api-request`	HTTP API requests using curl — GET, POST, PUT, DELETE with headers and JSON
`docker`	Docker container operations — build, run, ps, logs, compose
`file-ops`	File system operations — list, search, read, and analyze files
`git`	Git version control — status, log, diff, commit, branch management
`mcp-generate`	Generate MCP-to-skill bridges for external tool servers
`setup-guide`	Configuration reference — LLM providers, memory, tools, and operating modes
`skill-audit`	Spec compliance and security review of installed skills
`skill-creator`	Create new skills following the agentskills.io specification
`system-info`	System diagnostics — OS, disk, memory, processes, uptime
`web-scrape`	Extract structured data from web pages using CSS selectors
`web-search`	Search the internet for current information

Use /skills in chat to see all available skills and their usage statistics.

Creating Custom Skills

A skill is a single SKILL.md file inside a named directory:

skills/
└── my-skill/
    └── SKILL.md

SKILL.md Format

Each file has two parts: a YAML header and a markdown body.

---
name: my-skill
description: Short description of what this skill does.
---
# My Skill

Instructions and examples go here.

Header fields:

Field	Required	Description
`name`	Yes	Unique identifier (1-64 chars, lowercase, hyphens allowed)
`description`	Yes	Used for embedding-based matching against user queries
`compatibility`	No	Runtime requirements (e.g., “requires curl”)
`license`	No	Skill license
`allowed-tools`	No	Comma-separated tool names this skill can use
`metadata`	No	Arbitrary key-value pairs for forward compatibility

Body: markdown with instructions, code examples, or reference material. Injected verbatim into the LLM context when the skill is selected.

Skill Resources

Skills can include additional resource directories:

skills/
└── system-info/
    ├── SKILL.md
    └── references/
        ├── linux.md
        ├── macos.md
        └── windows.md

Resources in scripts/, references/, and assets/ are loaded on demand with path traversal protection. OS-specific reference files (named linux.md, macos.md, windows.md) are automatically filtered by the current platform.

Name Validation

Skill names must be 1-64 characters, lowercase letters/numbers/hyphens only, no leading/trailing/consecutive hyphens, and must match the directory name.

Configuration

Skill Paths

By default, Zeph scans ./skills for skill directories. Add more paths in config:

[skills]
paths = ["./skills", "/home/user/my-skills"]

If a skill with the same name appears in multiple paths, the first one found takes priority.

Max Active Skills

Control how many skills are injected per query:

[skills]
max_active_skills = 5

Or via environment variable:

export ZEPH_SKILLS_MAX_ACTIVE=5

Lower values reduce prompt size but may miss relevant skills. Higher values include more context but use more tokens.

Progressive Loading

Only metadata (~100 tokens per skill) is loaded at startup for embedding and matching. Full body (<5000 tokens) is loaded lazily on first activation and cached via OnceLock. Resource files are loaded on demand.

With 50+ skills installed, a typical prompt still contains only 5 — saving thousands of tokens per request compared to naive full-injection approaches.

Hot Reload

SKILL.md file changes are detected via filesystem watcher (500ms debounce) and re-embedded without restart. Cached bodies are invalidated on reload.

With the Qdrant backend, hot-reload triggers a delta sync — only modified skills are re-embedded and updated in the collection.

Semantic Memory

Enable semantic search to retrieve contextually relevant messages from conversation history using vector similarity.

Requires an embedding model. Ollama with qwen3-embedding is the default. Claude API does not support embeddings natively — use the orchestrator to route embeddings through Ollama while using Claude for chat.

Setup

Start Qdrant:
```
docker compose up -d qdrant
```

Enable semantic memory in config:

[memory.semantic]
enabled = true
recall_limit = 5

Automatic setup: Qdrant collection (zeph_conversations) is created automatically on first use with correct vector dimensions (1024 for qwen3-embedding) and Cosine distance metric. No manual initialization required.

How It Works

Automatic embedding: Messages are embedded asynchronously using the configured embedding_model and stored in Qdrant alongside SQLite.
Semantic recall: Context builder injects semantically relevant messages from full history, not just recent messages.
Graceful degradation: If Qdrant is unavailable, Zeph falls back to SQLite-only mode (recency-based history).
Startup backfill: On startup, if Qdrant is available, Zeph calls embed_missing() to backfill embeddings for any messages stored while Qdrant was offline. This ensures the vector index stays in sync with SQLite without manual intervention.

Storage Architecture

Store	Purpose
SQLite	Source of truth for message text, conversations, summaries, skill usage
Qdrant	Vector index for semantic similarity search (embeddings only)

Both stores work together: SQLite holds the data, Qdrant enables vector search over it. The embeddings_metadata table in SQLite maps message IDs to Qdrant point IDs.

Context Engineering

Zeph’s context engineering pipeline manages how information flows into the LLM context window. It combines semantic recall, proportional budget allocation, message trimming, environment injection, tool output management, and runtime compaction into a unified system.

All context engineering features are disabled by default (context_budget_tokens = 0). Set a non-zero budget or enable auto_budget = true to activate the pipeline.

Configuration

[memory]
context_budget_tokens = 128000    # Set to your model's context window size (0 = unlimited)
compaction_threshold = 0.75       # Compact when usage exceeds this fraction
compaction_preserve_tail = 4      # Keep last N messages during compaction
prune_protect_tokens = 40000      # Protect recent N tokens from Tier 1 tool output pruning
cross_session_score_threshold = 0.35  # Minimum relevance for cross-session results (0.0-1.0)

[memory.semantic]
enabled = true                    # Required for semantic recall
recall_limit = 5                  # Max semantically relevant messages to inject

[tools]
summarize_output = false          # Enable LLM-based tool output summarization

Context Window Layout

When context_budget_tokens > 0, the context window is structured as:

┌─────────────────────────────────────────────────┐
│ BASE_PROMPT (identity + guidelines + security)  │  ~300 tokens
├─────────────────────────────────────────────────┤
│ <environment> cwd, git branch, os, model        │  ~50 tokens
├─────────────────────────────────────────────────┤
│ <project_context> ZEPH.md contents              │  0-500 tokens
├─────────────────────────────────────────────────┤
│ <repo_map> structural overview (if index on)    │  0-1024 tokens
├─────────────────────────────────────────────────┤
│ <available_skills> matched skills (full body)   │  200-2000 tokens
│ <other_skills> remaining (description-only)     │  50-200 tokens
├─────────────────────────────────────────────────┤
│ <code_context> RAG chunks (if index on)         │  30% of available
├─────────────────────────────────────────────────┤
│ [semantic recall] relevant past messages        │  10-25% of available
├─────────────────────────────────────────────────┤
│ [compaction summary] if compacted               │  200-500 tokens
├─────────────────────────────────────────────────┤
│ Recent message history                          │  50-60% of available
├─────────────────────────────────────────────────┤
│ [reserved for response generation]              │  20% of total
└─────────────────────────────────────────────────┘

Proportional Budget Allocation

Available tokens (after reserving 20% for response) are split proportionally. When code indexing is enabled, the code context slot takes a share from summaries, recall, and history:

Allocation	Without code index	With code index	Purpose
Summaries	15%	10%	Conversation summaries from SQLite
Semantic recall	25%	10%	Relevant messages from past conversations via Qdrant
Code context	–	30%	Retrieved code chunks from project index
Recent history	60%	50%	Most recent messages in current conversation

Semantic Recall Injection

When semantic memory is enabled, the agent queries Qdrant for messages relevant to the current user query. Results are injected as transient system messages (prefixed with [semantic recall]) that are:

Removed and re-injected on every turn (never stale)
Not persisted to SQLite
Bounded by the allocated token budget (25%, or 10% when code indexing is enabled)

Requires Qdrant and memory.semantic.enabled = true.

Message History Trimming

When recent messages exceed the 60% budget allocation, the oldest non-system messages are evicted. The system prompt and most recent messages are always preserved.

Environment Context

Every system prompt rebuild injects an <environment> block with:

Working directory
OS (linux, macos, windows)
Current git branch (if in a git repo)
Active model name

Two-Tier Context Pruning

When total message tokens exceed compaction_threshold (default: 75%) of the context budget, a two-tier pruning strategy activates:

Tier 1: Selective Tool Output Pruning

Before invoking the LLM for compaction, Zeph scans messages outside the protected tail for ToolOutput parts and replaces their content with a short placeholder. This is a cheap, synchronous operation that often frees enough tokens to stay under the threshold without an LLM call.

Only tool outputs in messages older than the protected tail are pruned
The most recent prune_protect_tokens tokens (default: 40,000) worth of messages are never pruned, preserving recent tool context
Pruned parts have their compacted_at timestamp set, body is cleared from memory to reclaim heap, and they are not pruned again
Pruned parts are persisted to SQLite before clearing, so pruning state survives session restarts
The tool_output_prunes metric tracks how many parts were pruned

Tier 2: LLM Compaction (Fallback)

If Tier 1 does not free enough tokens, the standard LLM compaction runs:

Middle messages (between system prompt and last N recent) are extracted
Sent to the LLM with a structured summarization prompt
Replaced with a single summary message
Last compaction_preserve_tail messages (default: 4) are always preserved

Both tiers are idempotent and run automatically during the agent loop.

Tool Output Management

Truncation

Tool outputs exceeding 30,000 characters are automatically truncated using a head+tail split with UTF-8 safe boundaries. Both the first and last ~15K chars are preserved.

Smart Summarization

When tools.summarize_output = true, long tool outputs are sent through the LLM with a prompt that preserves file paths, error messages, and numeric values. On LLM failure, falls back to truncation.

export ZEPH_TOOLS_SUMMARIZE_OUTPUT=true

Progressive Skill Loading

Skills matched by embedding similarity (top-K) are injected with their full body. Remaining skills are listed in a description-only <other_skills> catalog — giving the model awareness of all capabilities while consuming minimal tokens.

ZEPH.md Project Config

Zeph walks up the directory tree from the current working directory looking for:

ZEPH.md
ZEPH.local.md
.zeph/config.md

Found configs are concatenated (global first, then ancestors from root to cwd) and injected into the system prompt as a <project_context> block. Use this to provide project-specific instructions.

Environment Variables

Variable	Description	Default
`ZEPH_MEMORY_CONTEXT_BUDGET_TOKENS`	Context budget in tokens	`0` (unlimited)
`ZEPH_MEMORY_COMPACTION_THRESHOLD`	Compaction trigger threshold	`0.75`
`ZEPH_MEMORY_COMPACTION_PRESERVE_TAIL`	Messages preserved during compaction	`4`
`ZEPH_MEMORY_PRUNE_PROTECT_TOKENS`	Tokens protected from Tier 1 tool output pruning	`40000`
`ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLD`	Minimum relevance score for cross-session memory results	`0.35`
`ZEPH_TOOLS_SUMMARIZE_OUTPUT`	Enable LLM-based tool output summarization	`false`

Conversation Summarization

Automatically compress long conversation histories using LLM-based summarization to stay within context budget limits.

Requires an LLM provider (Ollama or Claude). Set context_budget_tokens = 0 to disable proportional allocation and use unlimited context.

For the full context management pipeline (semantic recall, message trimming, compaction, tool output management), see Context Engineering.

Configuration

[memory]
summarization_threshold = 100
context_budget_tokens = 8000  # Set to LLM context window size (0 = unlimited)

How It Works

Triggered when message count exceeds summarization_threshold (default: 100)
Summaries stored in SQLite with token estimates
Batch size = threshold/2 to balance summary quality with LLM call frequency
Context builder allocates proportional token budget:
- 15% for summaries
- 25% for semantic recall (if enabled)
- 60% for recent message history

Token Estimation

Token counts are estimated using a chars/4 heuristic (100x faster than tiktoken, ±25% accuracy). This is sufficient for proportional budget allocation where exact counts are not critical.

Docker Deployment

Docker Compose automatically pulls the latest image from GitHub Container Registry. To use a specific version, set ZEPH_IMAGE=ghcr.io/bug-ops/zeph:v0.9.5.

Quick Start (Ollama + Qdrant in containers)

# Pull Ollama models first
docker compose --profile cpu run --rm ollama ollama pull mistral:7b
docker compose --profile cpu run --rm ollama ollama pull qwen3-embedding

# Start all services
docker compose --profile cpu up

Apple Silicon (Ollama on host with Metal GPU)

# Use Ollama on macOS host for Metal GPU acceleration
ollama pull mistral:7b
ollama pull qwen3-embedding
ollama serve &

# Start Zeph + Qdrant, connect to host Ollama
ZEPH_LLM_BASE_URL=http://host.docker.internal:11434 docker compose up

Linux with NVIDIA GPU

# Pull models first
docker compose --profile gpu run --rm ollama ollama pull mistral:7b
docker compose --profile gpu run --rm ollama ollama pull qwen3-embedding

# Start all services with GPU
docker compose --profile gpu -f docker-compose.yml -f docker-compose.gpu.yml up

Age Vault (Encrypted Secrets)

# Mount key and vault files into container
docker compose -f docker-compose.yml -f docker-compose.vault.yml up

Override file paths via environment variables:

ZEPH_VAULT_KEY=./my-key.txt ZEPH_VAULT_PATH=./my-secrets.age \
  docker compose -f docker-compose.yml -f docker-compose.vault.yml up

The image must be built with vault-age feature enabled. For local builds, use CARGO_FEATURES=vault-age with docker-compose.dev.yml.

Using Specific Version

# Use a specific release version
ZEPH_IMAGE=ghcr.io/bug-ops/zeph:v0.9.5 docker compose up

# Always pull latest
docker compose pull && docker compose up

Local Development

Full stack with debug tracing (builds from source via Dockerfile.dev, uses host Ollama via host.docker.internal):

# Build and start Qdrant + Zeph with debug logging
docker compose -f docker-compose.dev.yml up --build

# Build with optional features (e.g. vault-age, candle)
CARGO_FEATURES=vault-age docker compose -f docker-compose.dev.yml up --build

# Build with vault-age and mount vault files
CARGO_FEATURES=vault-age \
  docker compose -f docker-compose.dev.yml -f docker-compose.vault.yml up --build

Dependencies only (run zeph natively on host):

# Start Qdrant
docker compose -f docker-compose.deps.yml up

# Run zeph natively with debug tracing
RUST_LOG=zeph=debug,zeph_channels=trace cargo run

MCP Integration

Connect external tool servers via Model Context Protocol (MCP). Tools are discovered, embedded, and matched alongside skills using the same cosine similarity pipeline — only relevant MCP tools are injected into the prompt, so adding more servers does not inflate token usage.

Configuration

Stdio Transport (spawn child process)

[[mcp.servers]]
id = "filesystem"
command = "npx"
args = ["-y", "@anthropic/mcp-filesystem"]

HTTP Transport (remote server)

[[mcp.servers]]
id = "remote-tools"
url = "http://localhost:8080/mcp"

Security

[mcp]
allowed_commands = ["npx", "uvx", "node", "python", "python3"]
max_dynamic_servers = 10

allowed_commands restricts which binaries can be spawned as MCP servers. max_dynamic_servers limits the number of servers added at runtime.

Dynamic Management

Add and remove MCP servers at runtime via chat commands:

/mcp add filesystem npx -y @anthropic/mcp-filesystem
/mcp add remote-api http://localhost:8080/mcp
/mcp list
/mcp remove filesystem

After adding or removing a server, Qdrant registry syncs automatically for semantic tool matching.

How Matching Works

MCP tools are embedded in Qdrant (zeph_mcp_tools collection) with BLAKE3 content-hash delta sync. Unified matching injects both skills and MCP tools into the system prompt by relevance score — keeping prompt size O(K) instead of O(N) where N is total tools across all servers.

OpenAI Provider

Use the OpenAI provider to connect to OpenAI API or any OpenAI-compatible service (Together AI, Groq, Fireworks, Perplexity).

ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... ./target/release/zeph

Configuration

[llm]
provider = "openai"

[llm.openai]
base_url = "https://api.openai.com/v1"
model = "gpt-5.2"
max_tokens = 4096
embedding_model = "text-embedding-3-small"   # optional, enables vector embeddings
reasoning_effort = "medium"                  # optional: low, medium, high (for o3, etc.)

Compatible APIs

Change base_url to point to any OpenAI-compatible API:

# Together AI
base_url = "https://api.together.xyz/v1"

# Groq
base_url = "https://api.groq.com/openai/v1"

# Fireworks
base_url = "https://api.fireworks.ai/inference/v1"

Embeddings

When embedding_model is set, Qdrant subsystems automatically use it for skill matching and semantic memory instead of the global llm.embedding_model.

Reasoning Models

Set reasoning_effort to control token budget for reasoning models like o3:

low — fast responses, less reasoning
medium — balanced
high — thorough reasoning, more tokens

Local Inference (Candle)

Run HuggingFace GGUF models locally via candle without external API dependencies. Metal and CUDA GPU acceleration are supported.

cargo build --release --features candle,metal  # macOS with Metal GPU

Configuration

[llm]
provider = "candle"

[llm.candle]
source = "huggingface"
repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
chat_template = "mistral"          # llama3, chatml, mistral, phi3, raw
embedding_repo = "sentence-transformers/all-MiniLM-L6-v2"  # optional BERT embeddings

[llm.candle.generation]
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
repeat_penalty = 1.1

Chat Templates

Template	Models
`llama3`	Llama 3, Llama 3.1
`chatml`	Qwen, Yi, OpenHermes
`mistral`	Mistral, Mixtral
`phi3`	Phi-3
`raw`	No template (raw completion)

Device Auto-Detection

macOS — Metal GPU (requires --features metal)
Linux with NVIDIA — CUDA (requires --features cuda)
Fallback — CPU

Model Orchestrator

Route tasks to different LLM providers based on content classification. Each task type maps to a provider chain with automatic fallback. Use the orchestrator to combine local and cloud models — for example, embeddings via Ollama and chat via Claude.

Configuration

[llm]
provider = "orchestrator"

[llm.orchestrator]
default = "claude"
embed = "ollama"

[llm.orchestrator.providers.ollama]
provider_type = "ollama"

[llm.orchestrator.providers.claude]
provider_type = "claude"

[llm.orchestrator.routes]
coding = ["claude", "ollama"]       # try Claude first, fallback to Ollama
creative = ["claude"]               # cloud only
analysis = ["claude", "ollama"]     # prefer cloud
general = ["claude"]                # cloud only

Provider Keys

default — provider for chat when no specific route matches
embed — provider for all embedding operations (skill matching, semantic memory)

Task Classification

Task types are classified via keyword heuristics:

Task Type	Keywords
`coding`	code, function, debug, refactor, implement
`creative`	write, story, poem, creative
`analysis`	analyze, compare, evaluate
`translation`	translate, convert language
`summarization`	summarize, summary, tldr
`general`	everything else

Fallback Chains

Routes define provider preference order. If the first provider fails, the next one in the list is tried automatically.

coding = ["local", "cloud"]  # try local first, fallback to cloud

Hybrid Setup Example

Embeddings via free local Ollama, chat via paid Claude API:

[llm]
provider = "orchestrator"

[llm.orchestrator]
default = "claude"
embed = "ollama"

[llm.orchestrator.providers.ollama]
provider_type = "ollama"

[llm.orchestrator.providers.claude]
provider_type = "claude"

[llm.orchestrator.routes]
general = ["claude"]

Self-Learning Skills

Automatically improve skills based on execution outcomes. When a skill fails repeatedly, Zeph uses self-reflection and LLM-generated improvements to create better skill versions.

Configuration

[skills.learning]
enabled = true
auto_activate = false     # require manual approval for new versions
min_failures = 3          # failures before triggering improvement
improve_threshold = 0.7   # success rate below which improvement starts
rollback_threshold = 0.5  # auto-rollback when success rate drops below this
min_evaluations = 5       # minimum evaluations before rollback decision
max_versions = 10         # max auto-generated versions per skill
cooldown_minutes = 60     # cooldown between improvements for same skill

How It Works

Each skill invocation is tracked as success or failure
When a skill’s success rate drops below improve_threshold, Zeph triggers self-reflection
The agent retries with adjusted context (1 retry per message)
If failures persist beyond min_failures, the LLM generates an improved skill version
New versions can be auto-activated or held for manual approval
If an activated version performs worse than rollback_threshold, automatic rollback occurs

Chat Commands

Command	Description
`/skill stats`	View execution metrics per skill
`/skill versions`	List auto-generated versions
`/skill activate <id>`	Activate a specific version
`/skill approve <id>`	Approve a pending version
`/skill reset <name>`	Revert to original version
`/feedback`	Provide explicit quality feedback

Set auto_activate = false (default) to review and manually approve LLM-generated skill improvements before they go live.

Skill versions and outcomes are stored in SQLite (skill_versions and skill_outcomes tables).

A2A Protocol

Zeph includes an embedded A2A protocol server for agent-to-agent communication. When enabled, other agents can discover and interact with Zeph via the standard A2A JSON-RPC 2.0 API.

Quick Start

ZEPH_A2A_ENABLED=true ZEPH_A2A_AUTH_TOKEN=secret ./target/release/zeph

Endpoints

Endpoint	Description	Auth
`/.well-known/agent-card.json`	Agent discovery	Public (no auth)
`/a2a`	JSON-RPC endpoint (`message/send`, `tasks/get`, `tasks/cancel`)	Bearer token
`/a2a/stream`	SSE streaming endpoint	Bearer token

Set ZEPH_A2A_AUTH_TOKEN to secure the server with bearer token authentication. The agent card endpoint remains public per A2A spec.

Configuration

[a2a]
enabled = true
host = "0.0.0.0"
port = 8080
public_url = "https://agent.example.com"
auth_token = "secret"
rate_limit = 60

Network Security

TLS enforcement: a2a.require_tls = true rejects HTTP endpoints (HTTPS only)
SSRF protection: a2a.ssrf_protection = true blocks private IP ranges (RFC 1918, loopback, link-local) via DNS resolution
Payload limits: a2a.max_body_size caps request body (default: 1 MiB)
Rate limiting: per-IP sliding window (default: 60 requests/minute)

Task Processing

Incoming message/send requests are routed through AgentTaskProcessor, which forwards the message to the configured LLM provider for real inference. The processor creates a task, sends the user message to the LLM, and returns the model response as a completed A2A task artifact.

Current limitation: the A2A task processor runs inference only (no tool execution or memory context).

A2A Client

Zeph can also connect to other A2A agents as a client:

A2aClient wraps reqwest, uses JSON-RPC 2.0 for all RPC calls
AgentRegistry with TTL-based cache for agent card discovery
SSE streaming via eventsource-stream for real-time task updates
Bearer token auth passed per-call to all client methods

Secrets Management

Zeph resolves secrets (ZEPH_CLAUDE_API_KEY, ZEPH_OPENAI_API_KEY, ZEPH_TELEGRAM_TOKEN, ZEPH_A2A_AUTH_TOKEN) through a pluggable VaultProvider with redacted debug output via the Secret newtype.

Never commit secrets to version control. Use environment variables or age-encrypted vault files.

Backend Selection

The vault backend is determined by the following priority (highest to lowest):

CLI flag: --vault env or --vault age
Environment variable: ZEPH_VAULT_BACKEND
Config file: vault.backend in TOML config
Default: "env"

Backends

Backend	Description	Activation
`env` (default)	Read secrets from environment variables	`--vault env` or omit
`age`	Decrypt age-encrypted JSON vault file at startup	`--vault age --vault-key <identity> --vault-path <vault.age>`

Environment Variables (default)

Export secrets as environment variables:

export ZEPH_CLAUDE_API_KEY=sk-ant-...
export ZEPH_TELEGRAM_TOKEN=123:ABC
./target/release/zeph

Age Vault

For production deployments, encrypt secrets with age:

# Generate an age identity key
age-keygen -o key.txt

# Create a JSON secrets file and encrypt it
echo '{"ZEPH_CLAUDE_API_KEY":"sk-...","ZEPH_TELEGRAM_TOKEN":"123:ABC"}' | \
  age -r $(grep 'public key' key.txt | awk '{print $NF}') -o secrets.age

# Run with age vault
cargo build --release --features vault-age
./target/release/zeph --vault age --vault-key key.txt --vault-path secrets.age

The vault-age feature flag is enabled by default. When building with --no-default-features, add vault-age explicitly if needed.

Docker

Mount key and vault files into the container:

docker compose -f docker-compose.yml -f docker-compose.vault.yml up

Override paths:

ZEPH_VAULT_KEY=./my-key.txt ZEPH_VAULT_PATH=./my-secrets.age \
  docker compose -f docker-compose.yml -f docker-compose.vault.yml up

Channels

Zeph supports multiple I/O channels for interacting with the agent. Each channel implements the Channel trait and can be selected at runtime based on configuration or CLI flags.

Available Channels

Channel	Activation	Streaming	Confirmation
CLI	Default (no config needed)	Token-by-token to stdout	y/N prompt
Telegram	`ZEPH_TELEGRAM_TOKEN` env var or `[telegram]` config	Edit-in-place every 10s	Reply “yes” to confirm
TUI	`--tui` flag or `ZEPH_TUI=true` (requires `tui` feature)	Real-time in chat panel	Auto-confirm (Phase 1)

CLI Channel

The default channel. Reads from stdin, writes to stdout with immediate streaming output.

./zeph

No configuration required. Supports all slash commands (/skills, /mcp, /reset).

Telegram Channel

Run Zeph as a Telegram bot with streaming responses, MarkdownV2 formatting, and user whitelisting.

Setup

Create a bot via @BotFather:
- Send /newbot and follow the prompts
- Copy the bot token (e.g., 123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11)

Configure the token via environment variable or config file:

# Environment variable
ZEPH_TELEGRAM_TOKEN="123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11" ./zeph

Or in config/default.toml:

[telegram]
allowed_users = ["your_username"]

The token can also be stored in the age-encrypted vault:

# Store in vault
ZEPH_TELEGRAM_TOKEN=your-token

The token is resolved via the vault provider (ZEPH_TELEGRAM_TOKEN secret). When using the env vault backend (default), set the environment variable directly. With the age backend, store it in the encrypted vault file.

User Whitelisting

Restrict bot access to specific Telegram usernames:

[telegram]
allowed_users = ["alice", "bob"]

When allowed_users is empty, the bot accepts messages from all users. Messages from unauthorized users are silently rejected with a warning log.

Bot Commands

Command	Description
`/start`	Welcome message
`/reset`	Reset conversation context
`/skills`	List loaded skills

Streaming Behavior

Telegram has API rate limits, so streaming works differently from CLI:

First chunk sends a new message immediately
Subsequent chunks edit the existing message in-place
Updates are throttled to one edit per 10 seconds to respect Telegram rate limits
On flush, a final edit delivers the complete response
Long messages (>4096 chars) are automatically split into multiple messages

MarkdownV2 Formatting

LLM responses are automatically converted from standard Markdown to Telegram’s MarkdownV2 format. Code blocks, bold, italic, and inline code are preserved. Special characters are escaped to prevent formatting errors.

Confirmation Prompts

When the agent needs user confirmation (e.g., destructive shell commands), Telegram sends a text prompt asking the user to reply “yes” to confirm.

TUI Dashboard

A rich terminal interface based on ratatui with real-time agent metrics. Requires the tui feature flag.

cargo build --release --features tui
./zeph --tui

See TUI Dashboard for full documentation including keybindings, layout, and architecture.

Message Queueing

Zeph maintains a bounded FIFO message queue (maximum 10 messages) to handle user input received during model inference. Queue behavior varies by channel:

CLI Channel

Blocking stdin read — the queue is always empty. CLI users cannot send messages while the agent is responding.

Telegram Channel

New messages are queued via an internal mpsc channel. Consecutive messages arriving within 500ms are automatically merged with a newline separator to reduce context fragmentation.

Use /clear-queue to discard queued messages.

TUI Channel

The input line remains interactive during model inference. Messages are queued in-order and drained after each response completes.

Queue badge: [+N queued] appears in the input area when messages are pending
Clear queue: Press Ctrl+K to discard all queued messages
Merging: Consecutive messages within 500ms are merged by newline

When the queue is full (10 messages), new input is silently dropped until space becomes available.

Channel Selection Logic

Zeph selects the channel at startup based on the following priority:

--tui flag or ZEPH_TUI=true → TUI channel (requires tui feature)
ZEPH_TELEGRAM_TOKEN set → Telegram channel
Otherwise → CLI channel

Only one channel is active per session.

Tool System

Zeph provides a typed tool system that gives the LLM structured access to file operations, shell commands, and web scraping. Each executor owns its tool definitions with schemas derived from Rust structs via schemars, ensuring a single source of truth between deserialization and prompt generation.

Tool Registry

Each tool executor declares its definitions via tool_definitions(). On every LLM turn the agent collects all definitions into a ToolRegistry and renders them into the system prompt as a <tools> catalog. Tool parameter schemas are auto-generated from Rust structs using #[derive(JsonSchema)] from the schemars crate.

Tool ID	Description	Invocation	Required Parameters	Optional Parameters
`bash`	Execute a shell command	```bash	`command` (string)
`read`	Read file contents	`ToolCall`	`path` (string)	`offset` (integer), `limit` (integer)
`edit`	Replace a string in a file	`ToolCall`	`path` (string), `old_string` (string), `new_string` (string)
`write`	Write content to a file	`ToolCall`	`path` (string), `content` (string)
`glob`	Find files matching a glob pattern	`ToolCall`	`pattern` (string)
`grep`	Search file contents with regex	`ToolCall`	`pattern` (string)	`path` (string), `case_sensitive` (boolean)
`web_scrape`	Scrape data from a web page via CSS selectors	```scrape	`url` (string), `select` (string)	`extract` (string), `limit` (integer)

FileExecutor

FileExecutor handles the file-oriented tools (read, write, edit, glob, grep) in a sandboxed environment. All file paths are validated against an allowlist before any I/O operation.

If allowed_paths is empty, the sandbox defaults to the current working directory.
Paths are resolved via ancestor-walk canonicalization to prevent traversal attacks on non-existing paths.
glob results are filtered post-match to exclude files outside the sandbox.
grep validates the search directory before scanning.

See Security for details on the path validation mechanism.

Dual-Mode Execution

The agent loop supports two tool invocation modes, distinguished by InvocationHint on each ToolDef:

Fenced block (InvocationHint::FencedBlock("bash") / FencedBlock("scrape")) — the LLM emits a fenced code block with the specified tag. ShellExecutor handles ```bash blocks, WebScrapeExecutor handles ```scrape blocks containing JSON with CSS selectors.
Structured tool call (InvocationHint::ToolCall) — the LLM emits a ToolCall with tool_id and typed params. CompositeExecutor routes the call to FileExecutor for file tools.

Both modes coexist in the same iteration. The system prompt includes invocation instructions per tool so the LLM knows exactly which format to use.

Iteration Control

The agent loop iterates tool execution until the LLM produces a response with no tool invocations, or one of the safety limits is hit.

Iteration cap

Controlled by max_tool_iterations (default: 10). The previous hardcoded limit of 3 is replaced by this configurable value.

[agent]
max_tool_iterations = 10

Environment variable: ZEPH_AGENT_MAX_TOOL_ITERATIONS.

Doom-loop detection

If 3 consecutive tool iterations produce identical output strings, the loop breaks and the agent notifies the user. This prevents infinite loops where the LLM repeatedly issues the same failing command.

Context budget check

At the start of each iteration, the agent estimates total token usage. If usage exceeds 80% of the configured context_budget_tokens, the loop stops to avoid exceeding the model’s context window.

Permissions

The [tools.permissions] section defines pattern-based access control per tool. Each tool ID maps to an ordered array of rules. Rules use glob patterns matched case-insensitively against the tool input (command string for bash, file path for file tools). First matching rule wins; if no rule matches, the default action is Ask.

Three actions are available:

Action	Behavior
`allow`	Execute silently without confirmation
`ask`	Prompt the user for confirmation before execution
`deny`	Block execution; denied tools are hidden from the LLM system prompt

[tools.permissions.bash]
[[tools.permissions.bash]]
pattern = "*sudo*"
action = "deny"

[[tools.permissions.bash]]
pattern = "cargo *"
action = "allow"

[[tools.permissions.bash]]
pattern = "*"
action = "ask"

When [tools.permissions] is absent, legacy blocked_commands and confirm_patterns from [tools.shell] are automatically converted to equivalent permission rules (deny and ask respectively).

Output Overflow

Tool output exceeding 30 000 characters is truncated (head + tail split) before being sent to the LLM. The full untruncated output is saved to ~/.zeph/data/tool-output/{uuid}.txt, and the truncated message includes the file path so the LLM can read the complete output if needed.

Stale overflow files older than 24 hours are cleaned up automatically on startup.

Configuration

[agent]
max_tool_iterations = 10   # Max tool loop iterations (default: 10)

[tools]
enabled = true
summarize_output = false

[tools.shell]
timeout = 30
allowed_paths = []         # Sandbox directories (empty = cwd only)

[tools.file]
allowed_paths = []         # Sandbox directories for file tools (empty = cwd only)

# Pattern-based permissions (optional; overrides legacy blocked_commands/confirm_patterns)
# [tools.permissions.bash]
# [[tools.permissions.bash]]
# pattern = "cargo *"
# action = "allow"

The tools.file.allowed_paths setting controls which directories FileExecutor can access for read, write, edit, glob, and grep operations. Shell and file sandboxes are configured independently.

Variable	Description
`ZEPH_AGENT_MAX_TOOL_ITERATIONS`	Max tool loop iterations (default: 10)

TUI Dashboard

Zeph includes an optional ratatui-based Terminal User Interface that replaces the plain CLI with a rich dashboard showing real-time agent metrics, conversation history, and an always-visible input line.

Enabling

The TUI requires the tui feature flag (disabled by default):

cargo build --release --features tui

Running

# Via CLI argument
zeph --tui

# Via environment variable
ZEPH_TUI=true zeph

Layout

+-------------------------------------------------------------+
| Zeph v0.9.5 | Provider: orchestrator | Model: claude-son... |
+----------------------------------------+--------------------+
|                                        | Skills (3/15)      |
|                                        | - setup-guide      |
|                                        | - git-workflow     |
|                                        |                    |
| [user] Can you check my code?         +--------------------+
|                                        | Memory             |
| [zeph] Sure, let me look at           | SQLite: 142 msgs   |
|        the code structure...           | Qdrant: connected  |
|                                       ▲+--------------------+
+----------------------------------------+--------------------+
| You: write a rust function for fibon_                       |
+-------------------------------------------------------------+
| [Insert] | Skills: 3 | Tokens: 4.2k | Qdrant: OK | 2m 15s |
+-------------------------------------------------------------+

Chat panel (left 70%): bottom-up message feed with full markdown rendering (bold, italic, code blocks, lists, headings), scrollbar with proportional thumb, and scroll indicators (▲/▼). Mouse wheel scrolling supported
Side panels (right 30%): skills, memory, and resources metrics — hidden on terminals < 80 cols
Input line: always visible, supports multiline input via Shift+Enter. Shows [+N queued] badge when messages are pending
Status bar: mode indicator, skill count, token usage, uptime
Splash screen: colored block-letter “ZEPH” banner on startup

Keybindings

Normal Mode

Key	Action
`i`	Enter Insert mode (focus input)
`q`	Quit application
`Ctrl+C`	Quit application
`Up` / `k`	Scroll chat up
`Down` / `j`	Scroll chat down
`Page Up/Down`	Scroll chat one page
`Home` / `End`	Scroll to top / bottom
`Mouse wheel`	Scroll chat up/down (3 lines per tick)
`d`	Toggle side panels on/off
`Tab`	Cycle side panel focus

Insert Mode

Key	Action
`Enter`	Submit input to agent
`Shift+Enter`	Insert newline (multiline input)
`Escape`	Switch to Normal mode
`Ctrl+C`	Quit application
`Ctrl+U`	Clear input line
`Ctrl+K`	Clear message queue

When a destructive command requires confirmation, a modal overlay appears:

Key	Action
`Y` / `Enter`	Confirm action
`N` / `Escape`	Cancel action

All other keys are blocked while the modal is visible.

Markdown Rendering

Chat messages are rendered with full markdown support via pulldown-cmark:

Element	Rendering
`bold`	Bold modifier
`italic`	Italic modifier
`inline code`	Blue text with dark background glow
Code blocks	Green text with dimmed language tag
`# Heading`	Bold + underlined
`- list item`	Green bullet (•) prefix
`> blockquote`	Dimmed vertical bar (│) prefix
`~~strikethrough~~`	Crossed-out modifier
`---`	Horizontal rule (─)

Thinking Blocks

When using Ollama models that emit reasoning traces (DeepSeek, Qwen), the <think>...</think> segments are rendered in a darker color (DarkGray) to visually separate model reasoning from the final response. Incomplete thinking blocks during streaming are also shown in the darker style.

Conversation History

On startup, the TUI loads the latest conversation from SQLite and displays it in the chat panel. This provides continuity across sessions.

Message Queueing

The TUI input line remains interactive during model inference, allowing you to queue up to 10 messages for sequential processing. This is useful for providing follow-up instructions without waiting for the current response to complete.

Queue Indicator

When messages are pending, a badge appears in the input area:

You: next message here [+3 queued]_

The counter shows how many messages are waiting to be processed. Queued messages are drained automatically after each response completes.

Message Merging

Consecutive messages submitted within 500ms are automatically merged with newline separators. This reduces context fragmentation when you send rapid-fire instructions.

Clearing the Queue

Press Ctrl+K in Insert mode to discard all queued messages. This is useful if you change your mind about pending instructions.

Alternatively, send the /clear-queue command to clear the queue programmatically.

Queue Limits

The queue holds a maximum of 10 messages. When full, new input is silently dropped until the agent drains the queue by processing pending messages.

Responsive Layout

The TUI adapts to terminal width:

Width	Layout
>= 80 cols	Full layout: chat (70%) + side panels (30%)
< 80 cols	Side panels hidden, chat takes full width

Live Metrics

The TUI dashboard displays real-time metrics collected from the agent loop via tokio::sync::watch channel:

Panel	Metrics
Skills	Active/total skill count, matched skill names per query
Memory	SQLite message count, conversation ID, Qdrant status, embeddings generated, summaries count, tool output prunes
Resources	Prompt/completion/total tokens, API calls, last LLM latency (ms), provider and model name

Metrics are updated at key instrumentation points in the agent loop:

After each LLM call (api_calls, latency, prompt tokens)
After streaming completes (completion tokens)
After skill matching (active skills, total skills)
After message persistence (sqlite message count)
After summarization (summaries count)

Token counts use a chars/4 estimation (sufficient for dashboard display).

Deferred Model Warmup

When running with Ollama (or an orchestrator with Ollama sub-providers), model warmup is deferred until after the TUI interface renders. This means:

The TUI appears immediately — no blank terminal while the model loads into GPU/CPU memory
A status indicator (“warming up model…”) appears in the chat panel
Warmup runs in the background via a spawned tokio task
Once complete, the status updates to “model ready” and the agent loop begins processing

If you send a message before warmup finishes, it is queued and processed automatically once the model is ready.

Note: In non-TUI modes (CLI, Telegram), warmup still runs synchronously before the agent loop starts.

Architecture

The TUI runs as three concurrent loops:

Crossterm event reader — dedicated OS thread (std::thread), sends key/tick/resize events via mpsc
TUI render loop — tokio task, draws frames at 10 FPS via tokio::select!, polls watch::Receiver for latest metrics before each draw
Agent loop — existing Agent::run(), communicates via TuiChannel and emits metrics via watch::Sender

TuiChannel implements the Channel trait, so it plugs into the agent with zero changes to the generic signature. MetricsSnapshot and MetricsCollector live in zeph-core to avoid circular dependencies — zeph-tui re-exports them.

Tracing

When TUI is active, tracing output is redirected to zeph.log to avoid corrupting the terminal display.

Docker

Docker images are built without the tui feature by default (headless operation). To build a Docker image with TUI support:

docker build -f Dockerfile.dev --build-arg CARGO_FEATURES=tui -t zeph:tui .

Code Indexing

AST-based code indexing and semantic retrieval for project-aware context. The zeph-index crate parses source files via tree-sitter, chunks them by AST structure, embeds the chunks in Qdrant, and retrieves relevant code via hybrid search (semantic + grep routing) for injection into the agent context window.

Disabled by default. Enable via [index] enabled = true in config.

Why Code RAG

Cloud models with 200K token windows can afford multi-round agentic grep. Local models with 8K-32K windows cannot: a single grep cycle costs ~2K tokens (25% of an 8K budget), while 5 rounds would exceed the entire context. RAG retrieves 6-8 relevant chunks in ~3K tokens, preserving budget for history and response.

For cloud models, code RAG serves as pre-fill context alongside agentic search. For local models, it is the primary code retrieval mechanism.

Setup

Start Qdrant (required for vector storage):
```
docker compose up -d qdrant
```
Enable indexing in config:
```
[index]
enabled = true
```
Index your project:
```
zeph index
```
Or let auto-indexing handle it on startup when auto_index = true (default).

Architecture

The zeph-index crate contains 7 modules:

Module	Purpose
`languages`	Language detection from file extensions, tree-sitter grammar registry
`chunker`	AST-based chunking with greedy sibling merge (cAST-inspired algorithm)
`context`	Contextualized embedding text generation (file path + scope + imports + code)
`store`	Dual-write storage: Qdrant vectors + SQLite chunk metadata
`indexer`	Orchestrator: walk project tree, chunk files, embed, store with incremental change detection
`retriever`	Query classification, semantic search, budget-aware chunk packing
`repo_map`	Compact structural map of the project (signatures only, no function bodies)

Pipeline

Source files
    |
    v
[languages.rs] detect language, load grammar
    |
    v
[chunker.rs] parse AST, split into chunks (target: ~600 non-ws chars)
    |
    v
[context.rs] prepend file path, scope chain, imports, language tag
    |
    v
[indexer.rs] embed via LlmProvider, skip unchanged (content hash)
    |
    v
[store.rs] upsert to Qdrant (vectors) + SQLite (metadata)

Retrieval

User query
    |
    v
[retriever.rs] classify_query()
    |
    +--> Semantic  --> embed query --> Qdrant search --> budget pack --> inject
    |
    +--> Grep      --> return empty (agent uses bash tools)
    |
    +--> Hybrid    --> semantic search + hint to agent

Query Classification

The retriever classifies each query to route it to the appropriate search strategy:

Strategy	Trigger	Action
Grep	Exact symbols: `::`, `fn` , `struct` , CamelCase, snake_case identifiers	Agent handles via shell grep/ripgrep
Semantic	Conceptual queries: “how”, “where”, “why”, “explain”	Vector similarity search in Qdrant
Hybrid	Both symbol patterns and conceptual words	Semantic search + hint that grep may also help

Default (no pattern match): Semantic.

AST-Based Chunking

Files are parsed via tree-sitter into AST, then chunked by entity boundaries (functions, structs, classes, impl blocks). The algorithm uses greedy sibling merge:

Target size: 600 non-whitespace characters (~300-400 tokens)
Max size: 1200 non-ws chars (forced recursive split)
Min size: 100 non-ws chars (merge with adjacent sibling)

Config files (TOML, JSON, Markdown, Bash) are indexed as single file-level chunks since they lack named entities.

Each chunk carries rich metadata: file path, language, AST node type, entity name, line range, scope chain (e.g. MyStruct > impl MyStruct > my_method), imports, and a BLAKE3 content hash for change detection.

Contextualized Embeddings

Embedding raw code alone yields poor retrieval quality for conceptual queries. Before embedding, each chunk is prepended with:

File path (# src/agent.rs)
Scope chain (# Scope: Agent > prepare_context)
Language tag (# Language: rust)
First 5 import/use statements

This contextualized form improves retrieval for queries like “where is auth handled?” where the code alone might not contain the word “auth”.

Storage

Chunks are dual-written to two stores:

Store	Data	Purpose
Qdrant (`zeph_code_chunks`)	Embedding vectors + payload (code, metadata)	Semantic similarity search
SQLite (`chunk_metadata`)	File path, content hash, line range, language, node type	Change detection, cleanup of deleted files

The Qdrant collection uses INT8 scalar quantization for ~4x memory reduction with minimal accuracy loss. Payload indexes on language, file_path, and node_type enable filtered search.

Incremental Indexing

On subsequent runs, the indexer skips unchanged chunks by checking BLAKE3 content hashes in SQLite. Only modified or new files are re-embedded. Deleted files are detected by comparing the current file set against the SQLite index, and their chunks are removed from both stores.

File Watcher

When watch = true (default), an IndexWatcher monitors project files for changes during the session. On file modification, the changed file is automatically re-indexed via reindex_file() without rebuilding the entire index. The watcher uses 1-second debounce to batch rapid changes and only processes files with indexable extensions.

Disable with:

[index]
watch = false

Repo Map

A lightweight structural map of the project, generated via tree-sitter signature extraction (no function bodies). Included in the system prompt and cached with a configurable TTL (default: 5 minutes) to avoid per-message filesystem traversal.

Example output:

<repo_map>
  src/agent.rs :: struct:Agent, impl:Agent, fn:new, fn:run, fn:prepare_context
  src/config.rs :: struct:Config, fn:load
  src/main.rs :: fn:main, fn:setup_logging
  ... and 12 more files
</repo_map>

The map is budget-constrained (default: 1024 tokens) and sorted by symbol count (files with more symbols appear first). It gives the model a structural overview of the project without consuming significant context.

Budget-Aware Retrieval

Retrieved chunks are packed into a token budget (default: 40% of available context for code). Chunks are sorted by similarity score and greedily packed until the budget is exhausted. A minimum score threshold (default: 0.25) filters low-relevance results.

Retrieved code is injected as a transient <code_context> XML block before the conversation history. It is re-generated on every turn and never persisted.

Context Window Layout (with Code RAG)

When code indexing is enabled, the context window includes two additional sections:

+---------------------------------------------------+
| System prompt + environment + ZEPH.md             |
+---------------------------------------------------+
| <repo_map> (structural overview, cached)          |  <= 1024 tokens
+---------------------------------------------------+
| <available_skills>                                |
+---------------------------------------------------+
| <code_context> (per-query RAG chunks, transient)  |  <= 30% available
+---------------------------------------------------+
| [semantic recall] past messages                   |  <= 10% available
+---------------------------------------------------+
| Recent message history                            |  <= 50% available
+---------------------------------------------------+
| [response reserve]                                |  20% of total
+---------------------------------------------------+

Configuration

[index]
# Enable codebase indexing for semantic code search.
# Requires Qdrant running (uses separate collection "zeph_code_chunks").
enabled = false

# Auto-index on startup and re-index changed files during session.
auto_index = true

# Directories to index (relative to cwd).
paths = ["."]

# Patterns to exclude (in addition to .gitignore).
exclude = ["target", "node_modules", ".git", "vendor", "dist", "build", "__pycache__"]

# Token budget for repo map in system prompt (0 = no repo map).
repo_map_budget = 1024

# Cache TTL for repo map in seconds (avoids per-message regeneration).
repo_map_ttl_secs = 300

[index.chunker]
# Target chunk size in non-whitespace characters (~300-400 tokens).
target_size = 600
# Maximum chunk size before forced split.
max_size = 1200
# Minimum chunk size — smaller chunks merge with siblings.
min_size = 100

[index.retrieval]
# Maximum chunks to fetch from Qdrant (before budget packing).
max_chunks = 12
# Minimum cosine similarity score to accept.
score_threshold = 0.25
# Maximum fraction of available context budget for code chunks.
budget_ratio = 0.40

Supported Languages

Language support is controlled by feature flags on the zeph-index crate. All default features are enabled when the index binary feature is active.

Language	Feature	Extensions
Rust	`lang-rust`	`.rs`
Python	`lang-python`	`.py`, `.pyi`
JavaScript	`lang-js`	`.js`, `.jsx`, `.mjs`, `.cjs`
TypeScript	`lang-js`	`.ts`, `.tsx`, `.mts`, `.cts`
Go	`lang-go`	`.go`
Bash	`lang-config`	`.sh`, `.bash`, `.zsh`
TOML	`lang-config`	`.toml`
JSON	`lang-config`	`.json`, `.jsonc`
Markdown	`lang-config`	`.md`, `.markdown`

Environment Variables

Variable	Description	Default
`ZEPH_INDEX_ENABLED`	Enable code indexing	`false`
`ZEPH_INDEX_AUTO_INDEX`	Auto-index on startup	`true`
`ZEPH_INDEX_REPO_MAP_BUDGET`	Token budget for repo map	`1024`
`ZEPH_INDEX_REPO_MAP_TTL_SECS`	Cache TTL for repo map in seconds	`300`

Embedding Model Recommendations

The indexer uses the same LlmProvider.embed() as semantic memory. Any embedding model works. For code-heavy workloads:

Model	Dims	Notes
`qwen3-embedding`	1024	Current Zeph default, good general performance
`nomic-embed-text`	768	Lightweight universal model
`nomic-embed-code`	768	Optimized for code, higher RAM (~7.5GB)

Architecture Overview

Cargo workspace (Edition 2024, resolver 3) with 10 crates + binary root.

Requires Rust 1.88+. Native async traits are used throughout — no async-trait crate.

Workspace Layout

zeph (binary) — thin bootstrap glue
├── zeph-core       Agent loop, config, config hot-reload, channel trait, context builder
├── zeph-llm        LlmProvider trait, Ollama + Claude + OpenAI + Candle backends, orchestrator, embeddings
├── zeph-skills     SKILL.md parser, registry with lazy body loading, embedding matcher, resource resolver, hot-reload
├── zeph-memory     SQLite + Qdrant, SemanticMemory orchestrator, summarization
├── zeph-channels   Telegram adapter (teloxide) with streaming
├── zeph-tools      ToolExecutor trait, ShellExecutor, WebScrapeExecutor, CompositeExecutor
├── zeph-index      AST-based code indexing, hybrid retrieval, repo map (optional)
├── zeph-mcp        MCP client via rmcp, multi-server lifecycle, unified tool matching (optional)
├── zeph-a2a        A2A protocol client + server, agent discovery, JSON-RPC 2.0 (optional)
└── zeph-tui        ratatui TUI dashboard with real-time metrics (optional)

Dependency Graph

zeph (binary)
  └── zeph-core (orchestrates everything)
        ├── zeph-llm (leaf)
        ├── zeph-skills (leaf)
        ├── zeph-memory (leaf)
        ├── zeph-channels (leaf)
        ├── zeph-tools (leaf)
        ├── zeph-index (optional, leaf)
        ├── zeph-mcp (optional, leaf)
        ├── zeph-a2a (optional, leaf)
        └── zeph-tui (optional, leaf)

zeph-core is the only crate that depends on other workspace crates. All leaf crates are independent and can be tested in isolation.

Agent Loop

The agent loop processes user input in a continuous cycle:

Read initial user message via channel.recv()
Build context from skills, memory, and environment
Stream LLM response token-by-token
Execute any tool calls in the response
Drain queued messages (if any) via channel.try_recv() and repeat from step 2

Queued messages are processed sequentially with full context rebuilding between each. Consecutive messages within 500ms are merged to reduce fragmentation. The queue holds a maximum of 10 messages; older messages are dropped when full.

Key Design Decisions

Generic Agent: Agent<P: LlmProvider + Clone + 'static, C: Channel, T: ToolExecutor> — fully generic over provider, channel, and tool executor
TLS: rustls everywhere (no openssl-sys)
Errors: thiserror for library crates, anyhow for application code (zeph-core, main.rs)
Lints: workspace-level clippy::all + clippy::pedantic + clippy::nursery; unsafe_code = "deny"
Dependencies: versions only in root [workspace.dependencies]; crates inherit via workspace = true
Feature gates: optional crates (zeph-index, zeph-mcp, zeph-a2a, zeph-tui) are feature-gated in the binary
Context engineering: proportional budget allocation, semantic recall injection, message trimming, runtime compaction, environment context injection, progressive skill loading, ZEPH.md project config discovery

Crates

Each workspace crate has a focused responsibility. All leaf crates are independent and testable in isolation; only zeph-core depends on other workspace members.

zeph-core

Agent loop, configuration loading, and context builder.

Agent<P, C, T> — main agent loop with streaming support, message queue drain, configurable max_tool_iterations (default 10), doom-loop detection, and context budget check (stops at 80% threshold)
Config — TOML config loading with env var overrides
Channel trait — abstraction for I/O (CLI, Telegram, TUI) with recv(), try_recv(), send_queue_count() for queue management
Context builder — assembles system prompt from skills, memory, summaries, environment, and project config
Context engineering — proportional budget allocation, semantic recall injection, message trimming, runtime compaction
EnvironmentContext — runtime gathering of cwd, git branch, OS, model name
project.rs — ZEPH.md config discovery (walk up directory tree)
VaultProvider trait — pluggable secret resolution
MetricsSnapshot / MetricsCollector — real-time metrics via tokio::sync::watch for TUI dashboard

zeph-llm

LLM provider abstraction and backend implementations.

LlmProvider trait — chat(), chat_stream(), embed(), supports_streaming(), supports_embeddings()
OllamaProvider — local inference via ollama-rs
ClaudeProvider — Anthropic Messages API with SSE streaming
OpenAiProvider — OpenAI + compatible APIs (raw reqwest)
CandleProvider — local GGUF model inference via candle
AnyProvider — enum dispatch for runtime provider selection
ModelOrchestrator — task-based multi-model routing with fallback chains

zeph-skills

SKILL.md loader, skill registry, and prompt formatter.

SkillMeta / Skill — metadata + lazy body loading via OnceLock
SkillRegistry — manages skill lifecycle, lazy body access
SkillMatcher — in-memory cosine similarity matching
QdrantSkillMatcher — persistent embeddings with BLAKE3 delta sync
format_skills_prompt() — assembles prompt with OS-filtered resources
format_skills_catalog() — description-only entries for non-matched skills
resource.rs — discover_resources() + load_resource() with path traversal protection
Filesystem watcher for hot-reload (500ms debounce)

zeph-memory

SQLite-backed conversation persistence with Qdrant vector search.

SqliteStore — conversations, messages, summaries, skill usage, skill versions
QdrantStore — vector storage and cosine similarity search
SemanticMemory<P> — orchestrator coordinating SQLite + Qdrant + LlmProvider
Automatic collection creation, graceful degradation without Qdrant

zeph-channels

Channel implementations for the Zeph agent.

CliChannel — stdin/stdout with immediate streaming output, blocking recv (queue always empty)
TelegramChannel — teloxide adapter with MarkdownV2 rendering, streaming via edit-in-place, user whitelisting, inline confirmation keyboards, mpsc-backed message queue with 500ms merge window

zeph-tools

Tool execution abstraction and shell backend.

ToolExecutor trait — accepts LLM response or structured ToolCall, returns tool output
ToolRegistry — typed definitions for 7 built-in tools (bash, read, edit, write, glob, grep, web_scrape), injected into system prompt as <tools> catalog
ToolCall / execute_tool_call() — structured tool invocation with typed parameters alongside legacy bash extraction (dual-mode)
FileExecutor — sandboxed file operations (read, write, edit, glob, grep) with ancestor-walk path canonicalization
ShellExecutor — bash block parser, command safety filter, sandbox validation
WebScrapeExecutor — HTML scraping with CSS selectors, SSRF protection
CompositeExecutor<A, B> — generic chaining with first-match-wins dispatch, routes structured tool calls by tool_id to the appropriate backend
AuditLogger — structured JSON audit trail for all executions
truncate_tool_output() — head+tail split at 30K chars with UTF-8 safe boundaries

zeph-index

AST-based code indexing, semantic retrieval, and repo map generation (optional, feature-gated).

Lang enum — supported languages with tree-sitter grammar registry, feature-gated per language group
chunk_file() — AST-based chunking with greedy sibling merge, scope chains, import extraction
contextualize_for_embedding() — prepends file path, scope, language, imports to code for better embedding quality
CodeStore — dual-write storage: Qdrant vectors (zeph_code_chunks collection) + SQLite metadata with BLAKE3 content-hash change detection
CodeIndexer<P> — project indexer orchestrator: walk, chunk, embed, store with incremental skip of unchanged chunks
CodeRetriever<P> — hybrid retrieval with query classification (Semantic / Grep / Hybrid), budget-aware chunk packing
generate_repo_map() — compact structural view via tree-sitter signature extraction, budget-constrained

zeph-mcp

MCP client for external tool servers (optional, feature-gated).

McpClient / McpManager — multi-server lifecycle management
McpToolExecutor — tool execution via MCP protocol
McpToolRegistry — tool embeddings in Qdrant with delta sync
Dual transport: Stdio (child process) and HTTP (Streamable HTTP)
Dynamic server management via /mcp add, /mcp remove

zeph-a2a

A2A protocol client and server (optional, feature-gated).

A2aClient — JSON-RPC 2.0 client with SSE streaming
AgentRegistry — agent card discovery with TTL cache
AgentCardBuilder — construct agent cards from runtime config
A2A Server — axum-based HTTP server with bearer auth, rate limiting, body size limits
TaskManager — in-memory task lifecycle management

zeph-tui

ratatui-based TUI dashboard (optional, feature-gated).

TuiChannel — Channel trait implementation bridging agent loop and TUI render loop via mpsc, oneshot-based confirmation dialog, bounded message queue (max 10) with 500ms merge window
App — TUI state machine with Normal/Insert/Confirm modes, keybindings, scroll, live metrics polling via watch::Receiver, queue badge indicator [+N queued], Ctrl+K to clear queue
EventReader — crossterm event loop on dedicated OS thread (avoids tokio starvation)
Side panel widgets: skills (active/total), memory (SQLite, Qdrant, embeddings), resources (tokens, API calls, latency)
Chat widget with bottom-up message feed, pulldown-cmark markdown rendering, scrollbar with proportional thumb, mouse scroll, thinking block segmentation, and streaming cursor
Splash screen widget with colored block-letter banner
Conversation history loading from SQLite on startup
Confirmation modal overlay widget with Y/N keybindings and focus capture
Responsive layout: side panels hidden on terminals < 80 cols
Multiline input via Shift+Enter
Status bar with mode, skill count, tokens, Qdrant status, uptime
Panic hook for terminal state restoration
Re-exports MetricsSnapshot / MetricsCollector from zeph-core

Token Efficiency

Zeph’s prompt construction is designed to minimize token usage regardless of how many skills and MCP tools are installed.

The Problem

Naive AI agent implementations inject all available tools and instructions into every prompt. With 50 skills and 100 MCP tools, this means thousands of tokens consumed on every request — most of which are irrelevant to the user’s query.

Zeph’s Approach

Embedding-Based Selection

Per query, only the top-K most relevant skills (default: 5) are selected via cosine similarity of vector embeddings. The same pipeline handles MCP tools.

User query → embed(query) → cosine_similarity(query, skills) → top-K → inject into prompt

This makes prompt size O(K) instead of O(N), where:

K = max_active_skills (default: 5, configurable)
N = total skills + MCP tools installed

Progressive Loading

Even selected skills don’t load everything at once:

Stage	What loads	When	Token cost
Startup	Skill metadata (name, description)	Once	~100 tokens per skill
Query	Skill body (instructions, examples)	On match	<5000 tokens per skill
Query	Resource files (references, scripts)	On match + OS filter	Variable

Metadata is always in memory for matching. Bodies are loaded lazily via OnceLock and cached after first access. Resources are loaded on demand with OS filtering (e.g., linux.md only loads on Linux).

Two-Tier Skill Catalog

Non-matched skills are listed in a description-only <other_skills> catalog — giving the model awareness of all available capabilities without injecting their full bodies. This means the model can request a specific skill if needed, while consuming only ~20 tokens per unmatched skill instead of thousands.

MCP Tool Matching

MCP tools follow the same pipeline:

Tools are embedded in Qdrant (zeph_mcp_tools collection) with BLAKE3 content-hash delta sync
Only re-embedded when tool definitions change
Unified matching ranks both skills and MCP tools by relevance score
Prompt contains only the top-K combined results

Practical Impact

Scenario	Naive approach	Zeph
10 skills, no MCP	~50K tokens/prompt	~25K tokens/prompt
50 skills, 100 MCP tools	~250K tokens/prompt	~25K tokens/prompt
200 skills, 500 MCP tools	~1M tokens/prompt	~25K tokens/prompt

Prompt size stays constant as you add more capabilities. The only cost of more skills is a slightly larger embedding index in Qdrant or memory.

Two-Tier Context Pruning

Long conversations accumulate tool outputs that consume significant context space. Zeph uses a two-tier strategy: Tier 1 selectively prunes old tool outputs (cheap, no LLM call), and Tier 2 falls back to full LLM compaction only when Tier 1 is insufficient. See Context Engineering for details.

Configuration

[skills]
max_active_skills = 5  # Increase for broader context, decrease for faster/cheaper queries

export ZEPH_SKILLS_MAX_ACTIVE=3  # Override via env var

Security

Zeph implements defense-in-depth security for safe AI agent operations in production environments.

Shell Command Filtering

All shell commands from LLM responses pass through a security filter before execution. Commands matching blocked patterns are rejected with detailed error messages.

12 blocked patterns by default:

Pattern	Risk Category	Examples
`rm -rf /`, `rm -rf /*`	Filesystem destruction	Prevents accidental system wipe
`sudo`, `su`	Privilege escalation	Blocks unauthorized root access
`mkfs`, `fdisk`	Filesystem operations	Prevents disk formatting
`dd if=`, `dd of=`	Low-level disk I/O	Blocks dangerous write operations
`curl \| bash`, `wget \| sh`	Arbitrary code execution	Prevents remote code injection
`nc`, `ncat`, `netcat`	Network backdoors	Blocks reverse shell attempts
`shutdown`, `reboot`, `halt`	System control	Prevents service disruption

Configuration:

[tools.shell]
timeout = 30
blocked_commands = ["custom_pattern"]  # Additional patterns (additive to defaults)
allowed_paths = ["/home/user/workspace"]  # Restrict filesystem access
allow_network = true  # false blocks curl/wget/nc
confirm_patterns = ["rm ", "git push -f"]  # Destructive command patterns

Custom blocked patterns are additive — you cannot weaken default security. Matching is case-insensitive.

Shell Sandbox

Commands are validated against a configurable filesystem allowlist before execution:

allowed_paths = [] (default) restricts access to the working directory only
Paths are canonicalized to prevent traversal attacks (../../etc/passwd)
allow_network = false blocks network tools (curl, wget, nc, ncat, netcat)

Destructive Command Confirmation

Commands matching confirm_patterns trigger an interactive confirmation before execution:

CLI: y/N prompt on stdin
Telegram: inline keyboard with Confirm/Cancel buttons
Default patterns: rm, git push -f, git push --force, drop table, drop database, truncate
Configurable via tools.shell.confirm_patterns in TOML

File Executor Sandbox

FileExecutor enforces the same allowed_paths sandbox as the shell executor for all file operations (read, write, edit, glob, grep).

Path validation:

All paths are resolved to absolute form and canonicalized before access
Non-existing paths (e.g., for write) use ancestor-walk canonicalization: the resolver walks up the path tree to the nearest existing ancestor, canonicalizes it, then re-appends the remaining segments. This prevents symlink and .. traversal on paths that do not yet exist on disk
If the resolved path does not fall under any entry in allowed_paths, the operation is rejected with a SandboxViolation error

Glob and grep enforcement:

glob results are post-filtered: matched paths outside the sandbox are silently excluded
grep validates the search root directory before scanning begins

Configuration is shared with the shell sandbox:

[tools.shell]
allowed_paths = ["/home/user/workspace"]  # Empty = cwd only

Permission Policy

The [tools.permissions] config section provides fine-grained, pattern-based access control for each tool. Rules are evaluated in order (first match wins) using case-insensitive glob patterns against the tool input. See Tool System — Permissions for configuration details.

Key security properties:

Tools with all-deny rules are excluded from the LLM system prompt, preventing the model from attempting to use them
Legacy blocked_commands and confirm_patterns are auto-migrated to equivalent permission rules when [tools.permissions] is absent
Default action when no rule matches is Ask (confirmation required)

Audit Logging

Structured JSON audit log for all tool executions:

[tools.audit]
enabled = true
destination = "./data/audit.jsonl"  # or "stdout"

Each entry includes timestamp, tool name, command, result (success/blocked/error/timeout), and duration in milliseconds.

Secret Redaction

LLM responses are scanned for common secret patterns before display:

Detected patterns: sk-, AKIA, ghp_, gho_, xoxb-, xoxp-, sk_live_, sk_test_, -----BEGIN
Secrets replaced with [REDACTED] preserving original whitespace formatting
Enabled by default (security.redact_secrets = true), applied to both streaming and non-streaming responses

Timeout Policies

Configurable per-operation timeouts prevent hung connections:

[timeouts]
llm_seconds = 120       # LLM chat completion
embedding_seconds = 30  # Embedding generation
a2a_seconds = 30        # A2A remote calls

A2A Network Security

TLS enforcement: a2a.require_tls = true rejects HTTP endpoints (HTTPS only)
SSRF protection: a2a.ssrf_protection = true blocks private IP ranges (RFC 1918, loopback, link-local) via DNS resolution
Payload limits: a2a.max_body_size caps request body (default: 1 MiB)

Safe execution model:

Commands parsed for blocked patterns, then sandbox-validated, then confirmation-checked
Timeout enforcement (default: 30s, configurable)
Full errors logged to system, sanitized messages shown to users
Audit trail for all tool executions (when enabled)

Container Security

Security Layer	Implementation	Status
Base image	Oracle Linux 9 Slim	Production-hardened
Vulnerability scanning	Trivy in CI/CD	0 HIGH/CRITICAL CVEs
User privileges	Non-root `zeph` user (UID 1000)	Enforced
Attack surface	Minimal package installation	Distroless-style

Continuous security:

Every release scanned with Trivy before publishing
Automated Dependabot PRs for dependency updates
cargo-deny checks in CI for license/vulnerability compliance

Code Security

Rust-native memory safety guarantees:

Minimal unsafe: One audited unsafe block behind candle feature flag (memory-mapped safetensors loading). Core crates enforce #![deny(unsafe_code)]
No panic in production: unwrap() and expect() linted via clippy
Secure dependencies: All crates audited with cargo-deny
MSRV policy: Rust 1.88+ (Edition 2024) for latest security patches

Reporting Vulnerabilities

Do not open a public issue. Use GitHub Security Advisories to submit a private report.

Include: description, steps to reproduce, potential impact, suggested fix. Expect an initial response within 72 hours.

Feature Flags

Zeph uses Cargo feature flags to control optional functionality. Default features cover common use cases; platform-specific and experimental features are opt-in.

Feature	Default	Description
`a2a`	Enabled	A2A protocol client and server for agent-to-agent communication
`openai`	Enabled	OpenAI-compatible provider (GPT, Together, Groq, Fireworks, etc.)
`mcp`	Enabled	MCP client for external tool servers via stdio/HTTP transport
`candle`	Enabled	Local HuggingFace model inference via candle (GGUF quantized models)
`orchestrator`	Enabled	Multi-model routing with task-based classification and fallback chains
`self-learning`	Enabled	Skill evolution via failure detection, self-reflection, and LLM-generated improvements
`vault-age`	Enabled	Age-encrypted vault backend for file-based secret storage (age)
`index`	Enabled	AST-based code indexing and semantic retrieval via tree-sitter (guide)
`tui`	Disabled	ratatui-based TUI dashboard with real-time agent metrics
`metal`	Disabled	Metal GPU acceleration for candle on macOS (implies `candle`)
`cuda`	Disabled	CUDA GPU acceleration for candle on Linux (implies `candle`)

Build Examples

cargo build --release                                     # all default features
cargo build --release --features metal                    # macOS with Metal GPU
cargo build --release --features cuda                     # Linux with NVIDIA GPU
cargo build --release --features tui                      # with TUI dashboard
cargo build --release --no-default-features               # minimal binary

zeph-index Language Features

When index is enabled, tree-sitter grammars are controlled by sub-features on the zeph-index crate. All are enabled by default.

Feature	Languages
`lang-rust`	Rust
`lang-python`	Python
`lang-js`	JavaScript, TypeScript
`lang-go`	Go
`lang-config`	Bash, TOML, JSON, Markdown

Contributing

Thank you for considering contributing to Zeph.

Getting Started

Fork the repository
Clone your fork and create a branch from main
Install Rust 1.88+ (Edition 2024 required)
Run cargo build to verify the setup

Development

Build

cargo build

Test

# Run unit tests only (exclude integration tests)
cargo nextest run --workspace --lib --bins

# Run all tests including integration tests (requires Docker)
cargo nextest run --workspace --profile ci

Nextest profiles (.config/nextest.toml):

default: Runs all tests (unit + integration)
ci: CI environment, runs all tests with JUnit XML output for reporting

Integration Tests

Integration tests use testcontainers-rs to automatically spin up Docker containers for external services (Qdrant, etc.).

Prerequisites: Docker must be running on your machine.

# Run only integration tests
cargo nextest run --workspace --test '*integration*'

# Run unit tests only (skip integration tests)
cargo nextest run --workspace --lib --bins

# Run all tests
cargo nextest run --workspace

Integration test files are located in each crate’s tests/ directory and follow the *_integration.rs naming convention.

Lint

cargo +nightly fmt --check
cargo clippy --all-targets

Coverage

cargo llvm-cov --all-features --workspace

Workspace Structure

Crate	Purpose
`zeph-core`	Agent loop, config, channel trait
`zeph-llm`	LlmProvider trait, Ollama + Claude + OpenAI + Candle backends
`zeph-skills`	SKILL.md parser, registry, prompt formatter
`zeph-memory`	SQLite conversation persistence, Qdrant vector search
`zeph-channels`	Telegram adapter
`zeph-tools`	Tool executor, shell sandbox, web scraper
`zeph-index`	AST-based code indexing, semantic retrieval, repo map
`zeph-mcp`	MCP client, multi-server lifecycle
`zeph-a2a`	A2A protocol client and server
`zeph-tui`	ratatui TUI dashboard with real-time metrics

Pull Requests

Create a feature branch: feat/<scope>/<description> or fix/<scope>/<description>
Keep changes focused — one logical change per PR
Add tests for new functionality
Ensure all checks pass: cargo +nightly fmt, cargo clippy, cargo nextest run --lib --bins
Write a clear PR description following the template

Commit Messages

Use imperative mood: “Add feature” not “Added feature”
Keep the first line under 72 characters
Reference related issues when applicable

Code Style

Follow workspace clippy lints (pedantic enabled)
Use cargo +nightly fmt for formatting
Avoid unnecessary comments — code should be self-explanatory
Comments are only for cognitively complex blocks

License

By contributing, you agree that your contributions will be licensed under the MIT License.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

See the full CHANGELOG.md in the repository for the complete version history.

Keyboard shortcuts

Zeph Documentation