Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Zeph

Lightweight AI agent with hybrid inference (Ollama / Claude / OpenAI / HuggingFace via candle), skills-first architecture, semantic memory with Qdrant, MCP client, A2A protocol support, multi-model orchestration, self-learning skill evolution, and multi-channel I/O.

Only relevant skills and MCP tools are injected into each prompt via vector similarity — keeping token usage minimal regardless of how many are installed.

Cross-platform: Linux, macOS, Windows (x86_64 + ARM64).

Key Features

  • Hybrid inference — Ollama (local), Claude (Anthropic), OpenAI (GPT + compatible APIs), Candle (HuggingFace GGUF)
  • Skills-first architecture — embedding-based skill matching selects only top-K relevant skills per query, not all
  • Semantic memory — SQLite for structured data + Qdrant for vector similarity search
  • MCP client — connect external tool servers via Model Context Protocol (stdio + HTTP transport)
  • A2A protocol — agent-to-agent communication via JSON-RPC 2.0 with SSE streaming
  • Model orchestrator — route tasks to different providers with automatic fallback chains
  • Self-learning — skills evolve through failure detection, self-reflection, and LLM-generated improvements
  • Code indexing — AST-based code RAG with tree-sitter, hybrid retrieval (semantic + grep routing), repo map
  • Context engineering — proportional budget allocation, semantic recall injection, runtime compaction, smart tool output summarization, ZEPH.md project config
  • Multi-channel I/O — CLI, Telegram, and TUI with streaming support
  • Token-efficient — prompt size is O(K) not O(N), where K is max active skills and N is total installed

Quick Start

git clone https://github.com/bug-ops/zeph
cd zeph
cargo build --release
./target/release/zeph

See Installation for pre-built binaries and Docker options.

Requirements

  • Rust 1.88+ (Edition 2024)
  • Ollama (for local inference and embeddings) or cloud API key (Claude / OpenAI)
  • Docker (optional, for Qdrant semantic memory and containerized deployment)

Installation

Install Zeph from source, pre-built binaries, or Docker.

From Source

git clone https://github.com/bug-ops/zeph
cd zeph
cargo build --release

The binary is produced at target/release/zeph.

Pre-built Binaries

Download from GitHub Releases:

PlatformArchitectureDownload
Linuxx86_64zeph-x86_64-unknown-linux-gnu.tar.gz
Linuxaarch64zeph-aarch64-unknown-linux-gnu.tar.gz
macOSx86_64zeph-x86_64-apple-darwin.tar.gz
macOSaarch64zeph-aarch64-apple-darwin.tar.gz
Windowsx86_64zeph-x86_64-pc-windows-msvc.zip

Docker

Pull the latest image from GitHub Container Registry:

docker pull ghcr.io/bug-ops/zeph:latest

Or use a specific version:

docker pull ghcr.io/bug-ops/zeph:v0.9.5

Images are scanned with Trivy in CI/CD and use Oracle Linux 9 Slim base with 0 HIGH/CRITICAL CVEs. Multi-platform: linux/amd64, linux/arm64.

See Docker Deployment for full deployment options including GPU support and age vault.

Quick Start

Run Zeph after building and interact via CLI, Telegram, or a cloud provider.

CLI Mode (default)

Unix (Linux/macOS):

./target/release/zeph

Windows:

.\target\release\zeph.exe

Type messages at the You: prompt. Type exit, quit, or press Ctrl-D to stop.

Telegram Mode

Unix (Linux/macOS):

ZEPH_TELEGRAM_TOKEN="123:ABC" ./target/release/zeph

Windows:

$env:ZEPH_TELEGRAM_TOKEN="123:ABC"; .\target\release\zeph.exe

Restrict access by setting telegram.allowed_users in the config file:

[telegram]
allowed_users = ["your_username"]

Ollama Setup

When using Ollama (default provider), ensure both the LLM model and embedding model are pulled:

ollama pull mistral:7b
ollama pull qwen3-embedding

The default configuration uses mistral:7b for text generation and qwen3-embedding for vector embeddings.

Cloud Providers

For Claude:

ZEPH_CLAUDE_API_KEY=sk-ant-... ./target/release/zeph

For OpenAI:

ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... ./target/release/zeph

See Configuration for the full reference.

Configuration

Zeph loads config/default.toml at startup and applies environment variable overrides.

The config path can be overridden via CLI argument or environment variable:

# CLI argument (highest priority)
zeph --config /path/to/custom.toml

# Environment variable
ZEPH_CONFIG=/path/to/custom.toml zeph

# Default (fallback)
# config/default.toml

Priority: --config > ZEPH_CONFIG > config/default.toml.

Hot-Reload

Zeph watches the config file for changes and applies runtime-safe fields without restart. The file watcher uses 500ms debounce to avoid redundant reloads.

Reloadable fields (applied immediately):

SectionFields
[security]redact_secrets
[timeouts]llm_seconds, embedding_seconds, a2a_seconds
[memory]history_limit, summarization_threshold, context_budget_tokens, compaction_threshold, compaction_preserve_tail, prune_protect_tokens, cross_session_score_threshold
[memory.semantic]recall_limit
[index]repo_map_ttl_secs, watch
[agent]max_tool_iterations
[skills]max_active_skills

Not reloadable (require restart): LLM provider/model, SQLite path, Qdrant URL, Telegram token, MCP servers, A2A config, skill paths.

Check for config reloaded in the log to confirm a successful reload.

Configuration File

[agent]
name = "Zeph"
max_tool_iterations = 10  # Max tool loop iterations per response (default: 10)

[llm]
provider = "ollama"
base_url = "http://localhost:11434"
model = "mistral:7b"
embedding_model = "qwen3-embedding"  # Model for text embeddings

[llm.cloud]
model = "claude-sonnet-4-5-20250929"
max_tokens = 4096

# [llm.openai]
# base_url = "https://api.openai.com/v1"
# model = "gpt-5.2"
# max_tokens = 4096
# embedding_model = "text-embedding-3-small"
# reasoning_effort = "medium"  # low, medium, high (for reasoning models)

[skills]
paths = ["./skills"]
max_active_skills = 5  # Top-K skills per query via embedding similarity

[memory]
sqlite_path = "./data/zeph.db"
history_limit = 50
summarization_threshold = 100  # Trigger summarization after N messages
context_budget_tokens = 0      # 0 = unlimited (proportional split: 15% summaries, 25% recall, 60% recent)
compaction_threshold = 0.75    # Compact when context usage exceeds this fraction
compaction_preserve_tail = 4   # Keep last N messages during compaction
prune_protect_tokens = 40000   # Protect recent N tokens from tool output pruning
cross_session_score_threshold = 0.35  # Minimum relevance for cross-session results

[memory.semantic]
enabled = false               # Enable semantic search via Qdrant
recall_limit = 5              # Number of semantically relevant messages to inject

[tools]
enabled = true
summarize_output = false      # LLM-based summarization for long tool outputs

[tools.shell]
timeout = 30
blocked_commands = []
allowed_commands = []
allowed_paths = []          # Directories shell can access (empty = cwd only)
allow_network = true        # false blocks curl/wget/nc
confirm_patterns = ["rm ", "git push -f", "git push --force", "drop table", "drop database", "truncate "]

[tools.file]
allowed_paths = []          # Directories file tools can access (empty = cwd only)

# Pattern-based permissions per tool (optional; overrides legacy blocked_commands/confirm_patterns)
# [tools.permissions.bash]
# [[tools.permissions.bash]]
# pattern = "*sudo*"
# action = "deny"
# [[tools.permissions.bash]]
# pattern = "cargo *"
# action = "allow"
# [[tools.permissions.bash]]
# pattern = "*"
# action = "ask"

[tools.scrape]
timeout = 15
max_body_bytes = 1048576  # 1MB

[tools.audit]
enabled = false             # Structured JSON audit log for tool executions
destination = "stdout"      # "stdout" or file path

[security]
redact_secrets = true       # Redact API keys/tokens in LLM responses

[timeouts]
llm_seconds = 120           # LLM chat completion timeout
embedding_seconds = 30      # Embedding generation timeout
a2a_seconds = 30            # A2A remote call timeout

[vault]
backend = "env"  # "env" (default) or "age"; CLI --vault overrides this

[a2a]
enabled = false
host = "0.0.0.0"
port = 8080
# public_url = "https://agent.example.com"
# auth_token = "secret"
rate_limit = 60

Shell commands are sandboxed with path restrictions, network control, and destructive command confirmation. See Security for details.

Environment Variables

VariableDescription
ZEPH_LLM_PROVIDERollama, claude, openai, candle, or orchestrator
ZEPH_LLM_BASE_URLOllama API endpoint
ZEPH_LLM_MODELModel name for Ollama
ZEPH_LLM_EMBEDDING_MODELEmbedding model for Ollama (default: qwen3-embedding)
ZEPH_CLAUDE_API_KEYAnthropic API key (required for Claude)
ZEPH_OPENAI_API_KEYOpenAI API key (required for OpenAI provider)
ZEPH_TELEGRAM_TOKENTelegram bot token (enables Telegram mode)
ZEPH_SQLITE_PATHSQLite database path
ZEPH_QDRANT_URLQdrant server URL (default: http://localhost:6334)
ZEPH_MEMORY_SUMMARIZATION_THRESHOLDTrigger summarization after N messages (default: 100)
ZEPH_MEMORY_CONTEXT_BUDGET_TOKENSContext budget for proportional token allocation (default: 0 = unlimited)
ZEPH_MEMORY_COMPACTION_THRESHOLDCompaction trigger threshold as fraction of context budget (default: 0.75)
ZEPH_MEMORY_COMPACTION_PRESERVE_TAILMessages preserved during compaction (default: 4)
ZEPH_MEMORY_PRUNE_PROTECT_TOKENSTokens protected from Tier 1 tool output pruning (default: 40000)
ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLDMinimum relevance score for cross-session memory (default: 0.35)
ZEPH_MEMORY_SEMANTIC_ENABLEDEnable semantic memory with Qdrant (default: false)
ZEPH_MEMORY_RECALL_LIMITMax semantically relevant messages to recall (default: 5)
ZEPH_SKILLS_MAX_ACTIVEMax skills per query via embedding match (default: 5)
ZEPH_AGENT_MAX_TOOL_ITERATIONSMax tool loop iterations per response (default: 10)
ZEPH_TOOLS_SUMMARIZE_OUTPUTEnable LLM-based tool output summarization (default: false)
ZEPH_TOOLS_TIMEOUTShell command timeout in seconds (default: 30)
ZEPH_TOOLS_SCRAPE_TIMEOUTWeb scrape request timeout in seconds (default: 15)
ZEPH_TOOLS_SCRAPE_MAX_BODYMax response body size in bytes (default: 1048576)
ZEPH_A2A_ENABLEDEnable A2A server (default: false)
ZEPH_A2A_HOSTA2A server bind address (default: 0.0.0.0)
ZEPH_A2A_PORTA2A server port (default: 8080)
ZEPH_A2A_PUBLIC_URLPublic URL for agent card discovery
ZEPH_A2A_AUTH_TOKENBearer token for A2A server authentication
ZEPH_A2A_RATE_LIMITMax requests per IP per minute (default: 60)
ZEPH_A2A_REQUIRE_TLSRequire HTTPS for outbound A2A connections (default: true)
ZEPH_A2A_SSRF_PROTECTIONBlock private/loopback IPs in A2A client (default: true)
ZEPH_A2A_MAX_BODY_SIZEMax request body size in bytes (default: 1048576)
ZEPH_TOOLS_FILE_ALLOWED_PATHSComma-separated directories file tools can access (empty = cwd)
ZEPH_TOOLS_SHELL_ALLOWED_PATHSComma-separated directories shell can access (empty = cwd)
ZEPH_TOOLS_SHELL_ALLOW_NETWORKAllow network commands from shell (default: true)
ZEPH_TOOLS_AUDIT_ENABLEDEnable audit logging for tool executions (default: false)
ZEPH_TOOLS_AUDIT_DESTINATIONAudit log destination: stdout or file path
ZEPH_SECURITY_REDACT_SECRETSRedact secrets in LLM responses (default: true)
ZEPH_TIMEOUT_LLMLLM call timeout in seconds (default: 120)
ZEPH_TIMEOUT_EMBEDDINGEmbedding generation timeout in seconds (default: 30)
ZEPH_TIMEOUT_A2AA2A remote call timeout in seconds (default: 30)
ZEPH_CONFIGPath to config file (default: config/default.toml)
ZEPH_TUIEnable TUI dashboard: true or 1 (requires tui feature)

Skills

Zeph uses an embedding-based skill system that dramatically reduces token consumption: instead of injecting all skills into every prompt, only the top-K most relevant (default: 5) are selected per query via cosine similarity of vector embeddings. Combined with progressive loading (metadata at startup, bodies on activation, resources on demand), this keeps prompt size constant regardless of how many skills are installed.

How It Works

  1. You send a message — for example, “check disk usage on this server”
  2. Zeph embeds your query using the configured embedding model
  3. Top matching skills are selected — by default, the 5 most relevant ones ranked by vector similarity
  4. Selected skills are injected into the system prompt, giving Zeph specific instructions and examples for the task
  5. Zeph responds using the knowledge from matched skills

This happens automatically on every message. You don’t need to activate skills manually.

Matching Backends

Zeph supports two skill matching backends:

  • In-memory (default) — embeddings are computed on startup and matched via cosine similarity. No external dependencies required.
  • Qdrant — when semantic memory is enabled and Qdrant is reachable, skill embeddings are persisted in a zeph_skills collection. On startup, only changed skills are re-embedded using BLAKE3 content hash comparison. If Qdrant becomes unavailable, Zeph falls back to in-memory matching automatically.

The Qdrant backend significantly reduces startup time when you have many skills, since unchanged skills skip the embedding step entirely.

Bundled Skills

SkillDescription
api-requestHTTP API requests using curl — GET, POST, PUT, DELETE with headers and JSON
dockerDocker container operations — build, run, ps, logs, compose
file-opsFile system operations — list, search, read, and analyze files
gitGit version control — status, log, diff, commit, branch management
mcp-generateGenerate MCP-to-skill bridges for external tool servers
setup-guideConfiguration reference — LLM providers, memory, tools, and operating modes
skill-auditSpec compliance and security review of installed skills
skill-creatorCreate new skills following the agentskills.io specification
system-infoSystem diagnostics — OS, disk, memory, processes, uptime
web-scrapeExtract structured data from web pages using CSS selectors
web-searchSearch the internet for current information

Use /skills in chat to see all available skills and their usage statistics.

Creating Custom Skills

A skill is a single SKILL.md file inside a named directory:

skills/
└── my-skill/
    └── SKILL.md

SKILL.md Format

Each file has two parts: a YAML header and a markdown body.

---
name: my-skill
description: Short description of what this skill does.
---
# My Skill

Instructions and examples go here.

Header fields:

FieldRequiredDescription
nameYesUnique identifier (1-64 chars, lowercase, hyphens allowed)
descriptionYesUsed for embedding-based matching against user queries
compatibilityNoRuntime requirements (e.g., “requires curl”)
licenseNoSkill license
allowed-toolsNoComma-separated tool names this skill can use
metadataNoArbitrary key-value pairs for forward compatibility

Body: markdown with instructions, code examples, or reference material. Injected verbatim into the LLM context when the skill is selected.

Skill Resources

Skills can include additional resource directories:

skills/
└── system-info/
    ├── SKILL.md
    └── references/
        ├── linux.md
        ├── macos.md
        └── windows.md

Resources in scripts/, references/, and assets/ are loaded on demand with path traversal protection. OS-specific reference files (named linux.md, macos.md, windows.md) are automatically filtered by the current platform.

Name Validation

Skill names must be 1-64 characters, lowercase letters/numbers/hyphens only, no leading/trailing/consecutive hyphens, and must match the directory name.

Configuration

Skill Paths

By default, Zeph scans ./skills for skill directories. Add more paths in config:

[skills]
paths = ["./skills", "/home/user/my-skills"]

If a skill with the same name appears in multiple paths, the first one found takes priority.

Max Active Skills

Control how many skills are injected per query:

[skills]
max_active_skills = 5

Or via environment variable:

export ZEPH_SKILLS_MAX_ACTIVE=5

Lower values reduce prompt size but may miss relevant skills. Higher values include more context but use more tokens.

Progressive Loading

Only metadata (~100 tokens per skill) is loaded at startup for embedding and matching. Full body (<5000 tokens) is loaded lazily on first activation and cached via OnceLock. Resource files are loaded on demand.

With 50+ skills installed, a typical prompt still contains only 5 — saving thousands of tokens per request compared to naive full-injection approaches.

Hot Reload

SKILL.md file changes are detected via filesystem watcher (500ms debounce) and re-embedded without restart. Cached bodies are invalidated on reload.

With the Qdrant backend, hot-reload triggers a delta sync — only modified skills are re-embedded and updated in the collection.

Semantic Memory

Enable semantic search to retrieve contextually relevant messages from conversation history using vector similarity.

Requires an embedding model. Ollama with qwen3-embedding is the default. Claude API does not support embeddings natively — use the orchestrator to route embeddings through Ollama while using Claude for chat.

Setup

  1. Start Qdrant:

    docker compose up -d qdrant
    
  2. Enable semantic memory in config:

    [memory.semantic]
    enabled = true
    recall_limit = 5
    
  3. Automatic setup: Qdrant collection (zeph_conversations) is created automatically on first use with correct vector dimensions (1024 for qwen3-embedding) and Cosine distance metric. No manual initialization required.

How It Works

  • Automatic embedding: Messages are embedded asynchronously using the configured embedding_model and stored in Qdrant alongside SQLite.
  • Semantic recall: Context builder injects semantically relevant messages from full history, not just recent messages.
  • Graceful degradation: If Qdrant is unavailable, Zeph falls back to SQLite-only mode (recency-based history).
  • Startup backfill: On startup, if Qdrant is available, Zeph calls embed_missing() to backfill embeddings for any messages stored while Qdrant was offline. This ensures the vector index stays in sync with SQLite without manual intervention.

Storage Architecture

StorePurpose
SQLiteSource of truth for message text, conversations, summaries, skill usage
QdrantVector index for semantic similarity search (embeddings only)

Both stores work together: SQLite holds the data, Qdrant enables vector search over it. The embeddings_metadata table in SQLite maps message IDs to Qdrant point IDs.

Context Engineering

Zeph’s context engineering pipeline manages how information flows into the LLM context window. It combines semantic recall, proportional budget allocation, message trimming, environment injection, tool output management, and runtime compaction into a unified system.

All context engineering features are disabled by default (context_budget_tokens = 0). Set a non-zero budget or enable auto_budget = true to activate the pipeline.

Configuration

[memory]
context_budget_tokens = 128000    # Set to your model's context window size (0 = unlimited)
compaction_threshold = 0.75       # Compact when usage exceeds this fraction
compaction_preserve_tail = 4      # Keep last N messages during compaction
prune_protect_tokens = 40000      # Protect recent N tokens from Tier 1 tool output pruning
cross_session_score_threshold = 0.35  # Minimum relevance for cross-session results (0.0-1.0)

[memory.semantic]
enabled = true                    # Required for semantic recall
recall_limit = 5                  # Max semantically relevant messages to inject

[tools]
summarize_output = false          # Enable LLM-based tool output summarization

Context Window Layout

When context_budget_tokens > 0, the context window is structured as:

┌─────────────────────────────────────────────────┐
│ BASE_PROMPT (identity + guidelines + security)  │  ~300 tokens
├─────────────────────────────────────────────────┤
│ <environment> cwd, git branch, os, model        │  ~50 tokens
├─────────────────────────────────────────────────┤
│ <project_context> ZEPH.md contents              │  0-500 tokens
├─────────────────────────────────────────────────┤
│ <repo_map> structural overview (if index on)    │  0-1024 tokens
├─────────────────────────────────────────────────┤
│ <available_skills> matched skills (full body)   │  200-2000 tokens
│ <other_skills> remaining (description-only)     │  50-200 tokens
├─────────────────────────────────────────────────┤
│ <code_context> RAG chunks (if index on)         │  30% of available
├─────────────────────────────────────────────────┤
│ [semantic recall] relevant past messages        │  10-25% of available
├─────────────────────────────────────────────────┤
│ [compaction summary] if compacted               │  200-500 tokens
├─────────────────────────────────────────────────┤
│ Recent message history                          │  50-60% of available
├─────────────────────────────────────────────────┤
│ [reserved for response generation]              │  20% of total
└─────────────────────────────────────────────────┘

Proportional Budget Allocation

Available tokens (after reserving 20% for response) are split proportionally. When code indexing is enabled, the code context slot takes a share from summaries, recall, and history:

AllocationWithout code indexWith code indexPurpose
Summaries15%10%Conversation summaries from SQLite
Semantic recall25%10%Relevant messages from past conversations via Qdrant
Code context30%Retrieved code chunks from project index
Recent history60%50%Most recent messages in current conversation

Semantic Recall Injection

When semantic memory is enabled, the agent queries Qdrant for messages relevant to the current user query. Results are injected as transient system messages (prefixed with [semantic recall]) that are:

  • Removed and re-injected on every turn (never stale)
  • Not persisted to SQLite
  • Bounded by the allocated token budget (25%, or 10% when code indexing is enabled)

Requires Qdrant and memory.semantic.enabled = true.

Message History Trimming

When recent messages exceed the 60% budget allocation, the oldest non-system messages are evicted. The system prompt and most recent messages are always preserved.

Environment Context

Every system prompt rebuild injects an <environment> block with:

  • Working directory
  • OS (linux, macos, windows)
  • Current git branch (if in a git repo)
  • Active model name

Two-Tier Context Pruning

When total message tokens exceed compaction_threshold (default: 75%) of the context budget, a two-tier pruning strategy activates:

Tier 1: Selective Tool Output Pruning

Before invoking the LLM for compaction, Zeph scans messages outside the protected tail for ToolOutput parts and replaces their content with a short placeholder. This is a cheap, synchronous operation that often frees enough tokens to stay under the threshold without an LLM call.

  • Only tool outputs in messages older than the protected tail are pruned
  • The most recent prune_protect_tokens tokens (default: 40,000) worth of messages are never pruned, preserving recent tool context
  • Pruned parts have their compacted_at timestamp set, body is cleared from memory to reclaim heap, and they are not pruned again
  • Pruned parts are persisted to SQLite before clearing, so pruning state survives session restarts
  • The tool_output_prunes metric tracks how many parts were pruned

Tier 2: LLM Compaction (Fallback)

If Tier 1 does not free enough tokens, the standard LLM compaction runs:

  1. Middle messages (between system prompt and last N recent) are extracted
  2. Sent to the LLM with a structured summarization prompt
  3. Replaced with a single summary message
  4. Last compaction_preserve_tail messages (default: 4) are always preserved

Both tiers are idempotent and run automatically during the agent loop.

Tool Output Management

Truncation

Tool outputs exceeding 30,000 characters are automatically truncated using a head+tail split with UTF-8 safe boundaries. Both the first and last ~15K chars are preserved.

Smart Summarization

When tools.summarize_output = true, long tool outputs are sent through the LLM with a prompt that preserves file paths, error messages, and numeric values. On LLM failure, falls back to truncation.

export ZEPH_TOOLS_SUMMARIZE_OUTPUT=true

Progressive Skill Loading

Skills matched by embedding similarity (top-K) are injected with their full body. Remaining skills are listed in a description-only <other_skills> catalog — giving the model awareness of all capabilities while consuming minimal tokens.

ZEPH.md Project Config

Zeph walks up the directory tree from the current working directory looking for:

  • ZEPH.md
  • ZEPH.local.md
  • .zeph/config.md

Found configs are concatenated (global first, then ancestors from root to cwd) and injected into the system prompt as a <project_context> block. Use this to provide project-specific instructions.

Environment Variables

VariableDescriptionDefault
ZEPH_MEMORY_CONTEXT_BUDGET_TOKENSContext budget in tokens0 (unlimited)
ZEPH_MEMORY_COMPACTION_THRESHOLDCompaction trigger threshold0.75
ZEPH_MEMORY_COMPACTION_PRESERVE_TAILMessages preserved during compaction4
ZEPH_MEMORY_PRUNE_PROTECT_TOKENSTokens protected from Tier 1 tool output pruning40000
ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLDMinimum relevance score for cross-session memory results0.35
ZEPH_TOOLS_SUMMARIZE_OUTPUTEnable LLM-based tool output summarizationfalse

Conversation Summarization

Automatically compress long conversation histories using LLM-based summarization to stay within context budget limits.

Requires an LLM provider (Ollama or Claude). Set context_budget_tokens = 0 to disable proportional allocation and use unlimited context.

For the full context management pipeline (semantic recall, message trimming, compaction, tool output management), see Context Engineering.

Configuration

[memory]
summarization_threshold = 100
context_budget_tokens = 8000  # Set to LLM context window size (0 = unlimited)

How It Works

  • Triggered when message count exceeds summarization_threshold (default: 100)
  • Summaries stored in SQLite with token estimates
  • Batch size = threshold/2 to balance summary quality with LLM call frequency
  • Context builder allocates proportional token budget:
    • 15% for summaries
    • 25% for semantic recall (if enabled)
    • 60% for recent message history

Token Estimation

Token counts are estimated using a chars/4 heuristic (100x faster than tiktoken, ±25% accuracy). This is sufficient for proportional budget allocation where exact counts are not critical.

Docker Deployment

Docker Compose automatically pulls the latest image from GitHub Container Registry. To use a specific version, set ZEPH_IMAGE=ghcr.io/bug-ops/zeph:v0.9.5.

Quick Start (Ollama + Qdrant in containers)

# Pull Ollama models first
docker compose --profile cpu run --rm ollama ollama pull mistral:7b
docker compose --profile cpu run --rm ollama ollama pull qwen3-embedding

# Start all services
docker compose --profile cpu up

Apple Silicon (Ollama on host with Metal GPU)

# Use Ollama on macOS host for Metal GPU acceleration
ollama pull mistral:7b
ollama pull qwen3-embedding
ollama serve &

# Start Zeph + Qdrant, connect to host Ollama
ZEPH_LLM_BASE_URL=http://host.docker.internal:11434 docker compose up

Linux with NVIDIA GPU

# Pull models first
docker compose --profile gpu run --rm ollama ollama pull mistral:7b
docker compose --profile gpu run --rm ollama ollama pull qwen3-embedding

# Start all services with GPU
docker compose --profile gpu -f docker-compose.yml -f docker-compose.gpu.yml up

Age Vault (Encrypted Secrets)

# Mount key and vault files into container
docker compose -f docker-compose.yml -f docker-compose.vault.yml up

Override file paths via environment variables:

ZEPH_VAULT_KEY=./my-key.txt ZEPH_VAULT_PATH=./my-secrets.age \
  docker compose -f docker-compose.yml -f docker-compose.vault.yml up

The image must be built with vault-age feature enabled. For local builds, use CARGO_FEATURES=vault-age with docker-compose.dev.yml.

Using Specific Version

# Use a specific release version
ZEPH_IMAGE=ghcr.io/bug-ops/zeph:v0.9.5 docker compose up

# Always pull latest
docker compose pull && docker compose up

Local Development

Full stack with debug tracing (builds from source via Dockerfile.dev, uses host Ollama via host.docker.internal):

# Build and start Qdrant + Zeph with debug logging
docker compose -f docker-compose.dev.yml up --build

# Build with optional features (e.g. vault-age, candle)
CARGO_FEATURES=vault-age docker compose -f docker-compose.dev.yml up --build

# Build with vault-age and mount vault files
CARGO_FEATURES=vault-age \
  docker compose -f docker-compose.dev.yml -f docker-compose.vault.yml up --build

Dependencies only (run zeph natively on host):

# Start Qdrant
docker compose -f docker-compose.deps.yml up

# Run zeph natively with debug tracing
RUST_LOG=zeph=debug,zeph_channels=trace cargo run

MCP Integration

Connect external tool servers via Model Context Protocol (MCP). Tools are discovered, embedded, and matched alongside skills using the same cosine similarity pipeline — only relevant MCP tools are injected into the prompt, so adding more servers does not inflate token usage.

Configuration

Stdio Transport (spawn child process)

[[mcp.servers]]
id = "filesystem"
command = "npx"
args = ["-y", "@anthropic/mcp-filesystem"]

HTTP Transport (remote server)

[[mcp.servers]]
id = "remote-tools"
url = "http://localhost:8080/mcp"

Security

[mcp]
allowed_commands = ["npx", "uvx", "node", "python", "python3"]
max_dynamic_servers = 10

allowed_commands restricts which binaries can be spawned as MCP servers. max_dynamic_servers limits the number of servers added at runtime.

Dynamic Management

Add and remove MCP servers at runtime via chat commands:

/mcp add filesystem npx -y @anthropic/mcp-filesystem
/mcp add remote-api http://localhost:8080/mcp
/mcp list
/mcp remove filesystem

After adding or removing a server, Qdrant registry syncs automatically for semantic tool matching.

How Matching Works

MCP tools are embedded in Qdrant (zeph_mcp_tools collection) with BLAKE3 content-hash delta sync. Unified matching injects both skills and MCP tools into the system prompt by relevance score — keeping prompt size O(K) instead of O(N) where N is total tools across all servers.

OpenAI Provider

Use the OpenAI provider to connect to OpenAI API or any OpenAI-compatible service (Together AI, Groq, Fireworks, Perplexity).

ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... ./target/release/zeph

Configuration

[llm]
provider = "openai"

[llm.openai]
base_url = "https://api.openai.com/v1"
model = "gpt-5.2"
max_tokens = 4096
embedding_model = "text-embedding-3-small"   # optional, enables vector embeddings
reasoning_effort = "medium"                  # optional: low, medium, high (for o3, etc.)

Compatible APIs

Change base_url to point to any OpenAI-compatible API:

# Together AI
base_url = "https://api.together.xyz/v1"

# Groq
base_url = "https://api.groq.com/openai/v1"

# Fireworks
base_url = "https://api.fireworks.ai/inference/v1"

Embeddings

When embedding_model is set, Qdrant subsystems automatically use it for skill matching and semantic memory instead of the global llm.embedding_model.

Reasoning Models

Set reasoning_effort to control token budget for reasoning models like o3:

  • low — fast responses, less reasoning
  • medium — balanced
  • high — thorough reasoning, more tokens

Local Inference (Candle)

Run HuggingFace GGUF models locally via candle without external API dependencies. Metal and CUDA GPU acceleration are supported.

cargo build --release --features candle,metal  # macOS with Metal GPU

Configuration

[llm]
provider = "candle"

[llm.candle]
source = "huggingface"
repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
chat_template = "mistral"          # llama3, chatml, mistral, phi3, raw
embedding_repo = "sentence-transformers/all-MiniLM-L6-v2"  # optional BERT embeddings

[llm.candle.generation]
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
repeat_penalty = 1.1

Chat Templates

TemplateModels
llama3Llama 3, Llama 3.1
chatmlQwen, Yi, OpenHermes
mistralMistral, Mixtral
phi3Phi-3
rawNo template (raw completion)

Device Auto-Detection

  • macOS — Metal GPU (requires --features metal)
  • Linux with NVIDIA — CUDA (requires --features cuda)
  • Fallback — CPU

Model Orchestrator

Route tasks to different LLM providers based on content classification. Each task type maps to a provider chain with automatic fallback. Use the orchestrator to combine local and cloud models — for example, embeddings via Ollama and chat via Claude.

Configuration

[llm]
provider = "orchestrator"

[llm.orchestrator]
default = "claude"
embed = "ollama"

[llm.orchestrator.providers.ollama]
provider_type = "ollama"

[llm.orchestrator.providers.claude]
provider_type = "claude"

[llm.orchestrator.routes]
coding = ["claude", "ollama"]       # try Claude first, fallback to Ollama
creative = ["claude"]               # cloud only
analysis = ["claude", "ollama"]     # prefer cloud
general = ["claude"]                # cloud only

Provider Keys

  • default — provider for chat when no specific route matches
  • embed — provider for all embedding operations (skill matching, semantic memory)

Task Classification

Task types are classified via keyword heuristics:

Task TypeKeywords
codingcode, function, debug, refactor, implement
creativewrite, story, poem, creative
analysisanalyze, compare, evaluate
translationtranslate, convert language
summarizationsummarize, summary, tldr
generaleverything else

Fallback Chains

Routes define provider preference order. If the first provider fails, the next one in the list is tried automatically.

coding = ["local", "cloud"]  # try local first, fallback to cloud

Hybrid Setup Example

Embeddings via free local Ollama, chat via paid Claude API:

[llm]
provider = "orchestrator"

[llm.orchestrator]
default = "claude"
embed = "ollama"

[llm.orchestrator.providers.ollama]
provider_type = "ollama"

[llm.orchestrator.providers.claude]
provider_type = "claude"

[llm.orchestrator.routes]
general = ["claude"]

Self-Learning Skills

Automatically improve skills based on execution outcomes. When a skill fails repeatedly, Zeph uses self-reflection and LLM-generated improvements to create better skill versions.

Configuration

[skills.learning]
enabled = true
auto_activate = false     # require manual approval for new versions
min_failures = 3          # failures before triggering improvement
improve_threshold = 0.7   # success rate below which improvement starts
rollback_threshold = 0.5  # auto-rollback when success rate drops below this
min_evaluations = 5       # minimum evaluations before rollback decision
max_versions = 10         # max auto-generated versions per skill
cooldown_minutes = 60     # cooldown between improvements for same skill

How It Works

  1. Each skill invocation is tracked as success or failure
  2. When a skill’s success rate drops below improve_threshold, Zeph triggers self-reflection
  3. The agent retries with adjusted context (1 retry per message)
  4. If failures persist beyond min_failures, the LLM generates an improved skill version
  5. New versions can be auto-activated or held for manual approval
  6. If an activated version performs worse than rollback_threshold, automatic rollback occurs

Chat Commands

CommandDescription
/skill statsView execution metrics per skill
/skill versionsList auto-generated versions
/skill activate <id>Activate a specific version
/skill approve <id>Approve a pending version
/skill reset <name>Revert to original version
/feedbackProvide explicit quality feedback

Set auto_activate = false (default) to review and manually approve LLM-generated skill improvements before they go live.

Skill versions and outcomes are stored in SQLite (skill_versions and skill_outcomes tables).

A2A Protocol

Zeph includes an embedded A2A protocol server for agent-to-agent communication. When enabled, other agents can discover and interact with Zeph via the standard A2A JSON-RPC 2.0 API.

Quick Start

ZEPH_A2A_ENABLED=true ZEPH_A2A_AUTH_TOKEN=secret ./target/release/zeph

Endpoints

EndpointDescriptionAuth
/.well-known/agent-card.jsonAgent discoveryPublic (no auth)
/a2aJSON-RPC endpoint (message/send, tasks/get, tasks/cancel)Bearer token
/a2a/streamSSE streaming endpointBearer token

Set ZEPH_A2A_AUTH_TOKEN to secure the server with bearer token authentication. The agent card endpoint remains public per A2A spec.

Configuration

[a2a]
enabled = true
host = "0.0.0.0"
port = 8080
public_url = "https://agent.example.com"
auth_token = "secret"
rate_limit = 60

Network Security

  • TLS enforcement: a2a.require_tls = true rejects HTTP endpoints (HTTPS only)
  • SSRF protection: a2a.ssrf_protection = true blocks private IP ranges (RFC 1918, loopback, link-local) via DNS resolution
  • Payload limits: a2a.max_body_size caps request body (default: 1 MiB)
  • Rate limiting: per-IP sliding window (default: 60 requests/minute)

Task Processing

Incoming message/send requests are routed through AgentTaskProcessor, which forwards the message to the configured LLM provider for real inference. The processor creates a task, sends the user message to the LLM, and returns the model response as a completed A2A task artifact.

Current limitation: the A2A task processor runs inference only (no tool execution or memory context).

A2A Client

Zeph can also connect to other A2A agents as a client:

  • A2aClient wraps reqwest, uses JSON-RPC 2.0 for all RPC calls
  • AgentRegistry with TTL-based cache for agent card discovery
  • SSE streaming via eventsource-stream for real-time task updates
  • Bearer token auth passed per-call to all client methods

Secrets Management

Zeph resolves secrets (ZEPH_CLAUDE_API_KEY, ZEPH_OPENAI_API_KEY, ZEPH_TELEGRAM_TOKEN, ZEPH_A2A_AUTH_TOKEN) through a pluggable VaultProvider with redacted debug output via the Secret newtype.

Never commit secrets to version control. Use environment variables or age-encrypted vault files.

Backend Selection

The vault backend is determined by the following priority (highest to lowest):

  1. CLI flag: --vault env or --vault age
  2. Environment variable: ZEPH_VAULT_BACKEND
  3. Config file: vault.backend in TOML config
  4. Default: "env"

Backends

BackendDescriptionActivation
env (default)Read secrets from environment variables--vault env or omit
ageDecrypt age-encrypted JSON vault file at startup--vault age --vault-key <identity> --vault-path <vault.age>

Environment Variables (default)

Export secrets as environment variables:

export ZEPH_CLAUDE_API_KEY=sk-ant-...
export ZEPH_TELEGRAM_TOKEN=123:ABC
./target/release/zeph

Age Vault

For production deployments, encrypt secrets with age:

# Generate an age identity key
age-keygen -o key.txt

# Create a JSON secrets file and encrypt it
echo '{"ZEPH_CLAUDE_API_KEY":"sk-...","ZEPH_TELEGRAM_TOKEN":"123:ABC"}' | \
  age -r $(grep 'public key' key.txt | awk '{print $NF}') -o secrets.age

# Run with age vault
cargo build --release --features vault-age
./target/release/zeph --vault age --vault-key key.txt --vault-path secrets.age

The vault-age feature flag is enabled by default. When building with --no-default-features, add vault-age explicitly if needed.

Docker

Mount key and vault files into the container:

docker compose -f docker-compose.yml -f docker-compose.vault.yml up

Override paths:

ZEPH_VAULT_KEY=./my-key.txt ZEPH_VAULT_PATH=./my-secrets.age \
  docker compose -f docker-compose.yml -f docker-compose.vault.yml up

Channels

Zeph supports multiple I/O channels for interacting with the agent. Each channel implements the Channel trait and can be selected at runtime based on configuration or CLI flags.

Available Channels

ChannelActivationStreamingConfirmation
CLIDefault (no config needed)Token-by-token to stdouty/N prompt
TelegramZEPH_TELEGRAM_TOKEN env var or [telegram] configEdit-in-place every 10sReply “yes” to confirm
TUI--tui flag or ZEPH_TUI=true (requires tui feature)Real-time in chat panelAuto-confirm (Phase 1)

CLI Channel

The default channel. Reads from stdin, writes to stdout with immediate streaming output.

./zeph

No configuration required. Supports all slash commands (/skills, /mcp, /reset).

Telegram Channel

Run Zeph as a Telegram bot with streaming responses, MarkdownV2 formatting, and user whitelisting.

Setup

  1. Create a bot via @BotFather:

    • Send /newbot and follow the prompts
    • Copy the bot token (e.g., 123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11)
  2. Configure the token via environment variable or config file:

    # Environment variable
    ZEPH_TELEGRAM_TOKEN="123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11" ./zeph
    

    Or in config/default.toml:

    [telegram]
    allowed_users = ["your_username"]
    

    The token can also be stored in the age-encrypted vault:

    # Store in vault
    ZEPH_TELEGRAM_TOKEN=your-token
    

The token is resolved via the vault provider (ZEPH_TELEGRAM_TOKEN secret). When using the env vault backend (default), set the environment variable directly. With the age backend, store it in the encrypted vault file.

User Whitelisting

Restrict bot access to specific Telegram usernames:

[telegram]
allowed_users = ["alice", "bob"]

When allowed_users is empty, the bot accepts messages from all users. Messages from unauthorized users are silently rejected with a warning log.

Bot Commands

CommandDescription
/startWelcome message
/resetReset conversation context
/skillsList loaded skills

Streaming Behavior

Telegram has API rate limits, so streaming works differently from CLI:

  • First chunk sends a new message immediately
  • Subsequent chunks edit the existing message in-place
  • Updates are throttled to one edit per 10 seconds to respect Telegram rate limits
  • On flush, a final edit delivers the complete response
  • Long messages (>4096 chars) are automatically split into multiple messages

MarkdownV2 Formatting

LLM responses are automatically converted from standard Markdown to Telegram’s MarkdownV2 format. Code blocks, bold, italic, and inline code are preserved. Special characters are escaped to prevent formatting errors.

Confirmation Prompts

When the agent needs user confirmation (e.g., destructive shell commands), Telegram sends a text prompt asking the user to reply “yes” to confirm.

TUI Dashboard

A rich terminal interface based on ratatui with real-time agent metrics. Requires the tui feature flag.

cargo build --release --features tui
./zeph --tui

See TUI Dashboard for full documentation including keybindings, layout, and architecture.

Message Queueing

Zeph maintains a bounded FIFO message queue (maximum 10 messages) to handle user input received during model inference. Queue behavior varies by channel:

CLI Channel

Blocking stdin read — the queue is always empty. CLI users cannot send messages while the agent is responding.

Telegram Channel

New messages are queued via an internal mpsc channel. Consecutive messages arriving within 500ms are automatically merged with a newline separator to reduce context fragmentation.

Use /clear-queue to discard queued messages.

TUI Channel

The input line remains interactive during model inference. Messages are queued in-order and drained after each response completes.

  • Queue badge: [+N queued] appears in the input area when messages are pending
  • Clear queue: Press Ctrl+K to discard all queued messages
  • Merging: Consecutive messages within 500ms are merged by newline

When the queue is full (10 messages), new input is silently dropped until space becomes available.

Channel Selection Logic

Zeph selects the channel at startup based on the following priority:

  1. --tui flag or ZEPH_TUI=true → TUI channel (requires tui feature)
  2. ZEPH_TELEGRAM_TOKEN set → Telegram channel
  3. Otherwise → CLI channel

Only one channel is active per session.

Tool System

Zeph provides a typed tool system that gives the LLM structured access to file operations, shell commands, and web scraping. Each executor owns its tool definitions with schemas derived from Rust structs via schemars, ensuring a single source of truth between deserialization and prompt generation.

Tool Registry

Each tool executor declares its definitions via tool_definitions(). On every LLM turn the agent collects all definitions into a ToolRegistry and renders them into the system prompt as a <tools> catalog. Tool parameter schemas are auto-generated from Rust structs using #[derive(JsonSchema)] from the schemars crate.

Tool IDDescriptionInvocationRequired ParametersOptional Parameters
bashExecute a shell command```bashcommand (string)
readRead file contentsToolCallpath (string)offset (integer), limit (integer)
editReplace a string in a fileToolCallpath (string), old_string (string), new_string (string)
writeWrite content to a fileToolCallpath (string), content (string)
globFind files matching a glob patternToolCallpattern (string)
grepSearch file contents with regexToolCallpattern (string)path (string), case_sensitive (boolean)
web_scrapeScrape data from a web page via CSS selectors```scrapeurl (string), select (string)extract (string), limit (integer)

FileExecutor

FileExecutor handles the file-oriented tools (read, write, edit, glob, grep) in a sandboxed environment. All file paths are validated against an allowlist before any I/O operation.

  • If allowed_paths is empty, the sandbox defaults to the current working directory.
  • Paths are resolved via ancestor-walk canonicalization to prevent traversal attacks on non-existing paths.
  • glob results are filtered post-match to exclude files outside the sandbox.
  • grep validates the search directory before scanning.

See Security for details on the path validation mechanism.

Dual-Mode Execution

The agent loop supports two tool invocation modes, distinguished by InvocationHint on each ToolDef:

  1. Fenced block (InvocationHint::FencedBlock("bash") / FencedBlock("scrape")) — the LLM emits a fenced code block with the specified tag. ShellExecutor handles ```bash blocks, WebScrapeExecutor handles ```scrape blocks containing JSON with CSS selectors.
  2. Structured tool call (InvocationHint::ToolCall) — the LLM emits a ToolCall with tool_id and typed params. CompositeExecutor routes the call to FileExecutor for file tools.

Both modes coexist in the same iteration. The system prompt includes invocation instructions per tool so the LLM knows exactly which format to use.

Iteration Control

The agent loop iterates tool execution until the LLM produces a response with no tool invocations, or one of the safety limits is hit.

Iteration cap

Controlled by max_tool_iterations (default: 10). The previous hardcoded limit of 3 is replaced by this configurable value.

[agent]
max_tool_iterations = 10

Environment variable: ZEPH_AGENT_MAX_TOOL_ITERATIONS.

Doom-loop detection

If 3 consecutive tool iterations produce identical output strings, the loop breaks and the agent notifies the user. This prevents infinite loops where the LLM repeatedly issues the same failing command.

Context budget check

At the start of each iteration, the agent estimates total token usage. If usage exceeds 80% of the configured context_budget_tokens, the loop stops to avoid exceeding the model’s context window.

Permissions

The [tools.permissions] section defines pattern-based access control per tool. Each tool ID maps to an ordered array of rules. Rules use glob patterns matched case-insensitively against the tool input (command string for bash, file path for file tools). First matching rule wins; if no rule matches, the default action is Ask.

Three actions are available:

ActionBehavior
allowExecute silently without confirmation
askPrompt the user for confirmation before execution
denyBlock execution; denied tools are hidden from the LLM system prompt
[tools.permissions.bash]
[[tools.permissions.bash]]
pattern = "*sudo*"
action = "deny"

[[tools.permissions.bash]]
pattern = "cargo *"
action = "allow"

[[tools.permissions.bash]]
pattern = "*"
action = "ask"

When [tools.permissions] is absent, legacy blocked_commands and confirm_patterns from [tools.shell] are automatically converted to equivalent permission rules (deny and ask respectively).

Output Overflow

Tool output exceeding 30 000 characters is truncated (head + tail split) before being sent to the LLM. The full untruncated output is saved to ~/.zeph/data/tool-output/{uuid}.txt, and the truncated message includes the file path so the LLM can read the complete output if needed.

Stale overflow files older than 24 hours are cleaned up automatically on startup.

Configuration

[agent]
max_tool_iterations = 10   # Max tool loop iterations (default: 10)

[tools]
enabled = true
summarize_output = false

[tools.shell]
timeout = 30
allowed_paths = []         # Sandbox directories (empty = cwd only)

[tools.file]
allowed_paths = []         # Sandbox directories for file tools (empty = cwd only)

# Pattern-based permissions (optional; overrides legacy blocked_commands/confirm_patterns)
# [tools.permissions.bash]
# [[tools.permissions.bash]]
# pattern = "cargo *"
# action = "allow"

The tools.file.allowed_paths setting controls which directories FileExecutor can access for read, write, edit, glob, and grep operations. Shell and file sandboxes are configured independently.

VariableDescription
ZEPH_AGENT_MAX_TOOL_ITERATIONSMax tool loop iterations (default: 10)

TUI Dashboard

Zeph includes an optional ratatui-based Terminal User Interface that replaces the plain CLI with a rich dashboard showing real-time agent metrics, conversation history, and an always-visible input line.

Enabling

The TUI requires the tui feature flag (disabled by default):

cargo build --release --features tui

Running

# Via CLI argument
zeph --tui

# Via environment variable
ZEPH_TUI=true zeph

Layout

+-------------------------------------------------------------+
| Zeph v0.9.5 | Provider: orchestrator | Model: claude-son... |
+----------------------------------------+--------------------+
|                                        | Skills (3/15)      |
|                                        | - setup-guide      |
|                                        | - git-workflow     |
|                                        |                    |
| [user] Can you check my code?         +--------------------+
|                                        | Memory             |
| [zeph] Sure, let me look at           | SQLite: 142 msgs   |
|        the code structure...           | Qdrant: connected  |
|                                       ▲+--------------------+
+----------------------------------------+--------------------+
| You: write a rust function for fibon_                       |
+-------------------------------------------------------------+
| [Insert] | Skills: 3 | Tokens: 4.2k | Qdrant: OK | 2m 15s |
+-------------------------------------------------------------+
  • Chat panel (left 70%): bottom-up message feed with full markdown rendering (bold, italic, code blocks, lists, headings), scrollbar with proportional thumb, and scroll indicators (▲/▼). Mouse wheel scrolling supported
  • Side panels (right 30%): skills, memory, and resources metrics — hidden on terminals < 80 cols
  • Input line: always visible, supports multiline input via Shift+Enter. Shows [+N queued] badge when messages are pending
  • Status bar: mode indicator, skill count, token usage, uptime
  • Splash screen: colored block-letter “ZEPH” banner on startup

Keybindings

Normal Mode

KeyAction
iEnter Insert mode (focus input)
qQuit application
Ctrl+CQuit application
Up / kScroll chat up
Down / jScroll chat down
Page Up/DownScroll chat one page
Home / EndScroll to top / bottom
Mouse wheelScroll chat up/down (3 lines per tick)
dToggle side panels on/off
TabCycle side panel focus

Insert Mode

KeyAction
EnterSubmit input to agent
Shift+EnterInsert newline (multiline input)
EscapeSwitch to Normal mode
Ctrl+CQuit application
Ctrl+UClear input line
Ctrl+KClear message queue

Confirmation Modal

When a destructive command requires confirmation, a modal overlay appears:

KeyAction
Y / EnterConfirm action
N / EscapeCancel action

All other keys are blocked while the modal is visible.

Markdown Rendering

Chat messages are rendered with full markdown support via pulldown-cmark:

ElementRendering
**bold**Bold modifier
*italic*Italic modifier
`inline code`Blue text with dark background glow
Code blocksGreen text with dimmed language tag
# HeadingBold + underlined
- list itemGreen bullet (•) prefix
> blockquoteDimmed vertical bar (│) prefix
~~strikethrough~~Crossed-out modifier
---Horizontal rule (─)

Thinking Blocks

When using Ollama models that emit reasoning traces (DeepSeek, Qwen), the <think>...</think> segments are rendered in a darker color (DarkGray) to visually separate model reasoning from the final response. Incomplete thinking blocks during streaming are also shown in the darker style.

Conversation History

On startup, the TUI loads the latest conversation from SQLite and displays it in the chat panel. This provides continuity across sessions.

Message Queueing

The TUI input line remains interactive during model inference, allowing you to queue up to 10 messages for sequential processing. This is useful for providing follow-up instructions without waiting for the current response to complete.

Queue Indicator

When messages are pending, a badge appears in the input area:

You: next message here [+3 queued]_

The counter shows how many messages are waiting to be processed. Queued messages are drained automatically after each response completes.

Message Merging

Consecutive messages submitted within 500ms are automatically merged with newline separators. This reduces context fragmentation when you send rapid-fire instructions.

Clearing the Queue

Press Ctrl+K in Insert mode to discard all queued messages. This is useful if you change your mind about pending instructions.

Alternatively, send the /clear-queue command to clear the queue programmatically.

Queue Limits

The queue holds a maximum of 10 messages. When full, new input is silently dropped until the agent drains the queue by processing pending messages.

Responsive Layout

The TUI adapts to terminal width:

WidthLayout
>= 80 colsFull layout: chat (70%) + side panels (30%)
< 80 colsSide panels hidden, chat takes full width

Live Metrics

The TUI dashboard displays real-time metrics collected from the agent loop via tokio::sync::watch channel:

PanelMetrics
SkillsActive/total skill count, matched skill names per query
MemorySQLite message count, conversation ID, Qdrant status, embeddings generated, summaries count, tool output prunes
ResourcesPrompt/completion/total tokens, API calls, last LLM latency (ms), provider and model name

Metrics are updated at key instrumentation points in the agent loop:

  • After each LLM call (api_calls, latency, prompt tokens)
  • After streaming completes (completion tokens)
  • After skill matching (active skills, total skills)
  • After message persistence (sqlite message count)
  • After summarization (summaries count)

Token counts use a chars/4 estimation (sufficient for dashboard display).

Deferred Model Warmup

When running with Ollama (or an orchestrator with Ollama sub-providers), model warmup is deferred until after the TUI interface renders. This means:

  1. The TUI appears immediately — no blank terminal while the model loads into GPU/CPU memory
  2. A status indicator (“warming up model…”) appears in the chat panel
  3. Warmup runs in the background via a spawned tokio task
  4. Once complete, the status updates to “model ready” and the agent loop begins processing

If you send a message before warmup finishes, it is queued and processed automatically once the model is ready.

Note: In non-TUI modes (CLI, Telegram), warmup still runs synchronously before the agent loop starts.

Architecture

The TUI runs as three concurrent loops:

  1. Crossterm event reader — dedicated OS thread (std::thread), sends key/tick/resize events via mpsc
  2. TUI render loop — tokio task, draws frames at 10 FPS via tokio::select!, polls watch::Receiver for latest metrics before each draw
  3. Agent loop — existing Agent::run(), communicates via TuiChannel and emits metrics via watch::Sender

TuiChannel implements the Channel trait, so it plugs into the agent with zero changes to the generic signature. MetricsSnapshot and MetricsCollector live in zeph-core to avoid circular dependencies — zeph-tui re-exports them.

Tracing

When TUI is active, tracing output is redirected to zeph.log to avoid corrupting the terminal display.

Docker

Docker images are built without the tui feature by default (headless operation). To build a Docker image with TUI support:

docker build -f Dockerfile.dev --build-arg CARGO_FEATURES=tui -t zeph:tui .

Code Indexing

AST-based code indexing and semantic retrieval for project-aware context. The zeph-index crate parses source files via tree-sitter, chunks them by AST structure, embeds the chunks in Qdrant, and retrieves relevant code via hybrid search (semantic + grep routing) for injection into the agent context window.

Disabled by default. Enable via [index] enabled = true in config.

Why Code RAG

Cloud models with 200K token windows can afford multi-round agentic grep. Local models with 8K-32K windows cannot: a single grep cycle costs ~2K tokens (25% of an 8K budget), while 5 rounds would exceed the entire context. RAG retrieves 6-8 relevant chunks in ~3K tokens, preserving budget for history and response.

For cloud models, code RAG serves as pre-fill context alongside agentic search. For local models, it is the primary code retrieval mechanism.

Setup

  1. Start Qdrant (required for vector storage):

    docker compose up -d qdrant
    
  2. Enable indexing in config:

    [index]
    enabled = true
    
  3. Index your project:

    zeph index
    

    Or let auto-indexing handle it on startup when auto_index = true (default).

Architecture

The zeph-index crate contains 7 modules:

ModulePurpose
languagesLanguage detection from file extensions, tree-sitter grammar registry
chunkerAST-based chunking with greedy sibling merge (cAST-inspired algorithm)
contextContextualized embedding text generation (file path + scope + imports + code)
storeDual-write storage: Qdrant vectors + SQLite chunk metadata
indexerOrchestrator: walk project tree, chunk files, embed, store with incremental change detection
retrieverQuery classification, semantic search, budget-aware chunk packing
repo_mapCompact structural map of the project (signatures only, no function bodies)

Pipeline

Source files
    |
    v
[languages.rs] detect language, load grammar
    |
    v
[chunker.rs] parse AST, split into chunks (target: ~600 non-ws chars)
    |
    v
[context.rs] prepend file path, scope chain, imports, language tag
    |
    v
[indexer.rs] embed via LlmProvider, skip unchanged (content hash)
    |
    v
[store.rs] upsert to Qdrant (vectors) + SQLite (metadata)

Retrieval

User query
    |
    v
[retriever.rs] classify_query()
    |
    +--> Semantic  --> embed query --> Qdrant search --> budget pack --> inject
    |
    +--> Grep      --> return empty (agent uses bash tools)
    |
    +--> Hybrid    --> semantic search + hint to agent

Query Classification

The retriever classifies each query to route it to the appropriate search strategy:

StrategyTriggerAction
GrepExact symbols: ::, fn , struct , CamelCase, snake_case identifiersAgent handles via shell grep/ripgrep
SemanticConceptual queries: “how”, “where”, “why”, “explain”Vector similarity search in Qdrant
HybridBoth symbol patterns and conceptual wordsSemantic search + hint that grep may also help

Default (no pattern match): Semantic.

AST-Based Chunking

Files are parsed via tree-sitter into AST, then chunked by entity boundaries (functions, structs, classes, impl blocks). The algorithm uses greedy sibling merge:

  • Target size: 600 non-whitespace characters (~300-400 tokens)
  • Max size: 1200 non-ws chars (forced recursive split)
  • Min size: 100 non-ws chars (merge with adjacent sibling)

Config files (TOML, JSON, Markdown, Bash) are indexed as single file-level chunks since they lack named entities.

Each chunk carries rich metadata: file path, language, AST node type, entity name, line range, scope chain (e.g. MyStruct > impl MyStruct > my_method), imports, and a BLAKE3 content hash for change detection.

Contextualized Embeddings

Embedding raw code alone yields poor retrieval quality for conceptual queries. Before embedding, each chunk is prepended with:

  • File path (# src/agent.rs)
  • Scope chain (# Scope: Agent > prepare_context)
  • Language tag (# Language: rust)
  • First 5 import/use statements

This contextualized form improves retrieval for queries like “where is auth handled?” where the code alone might not contain the word “auth”.

Storage

Chunks are dual-written to two stores:

StoreDataPurpose
Qdrant (zeph_code_chunks)Embedding vectors + payload (code, metadata)Semantic similarity search
SQLite (chunk_metadata)File path, content hash, line range, language, node typeChange detection, cleanup of deleted files

The Qdrant collection uses INT8 scalar quantization for ~4x memory reduction with minimal accuracy loss. Payload indexes on language, file_path, and node_type enable filtered search.

Incremental Indexing

On subsequent runs, the indexer skips unchanged chunks by checking BLAKE3 content hashes in SQLite. Only modified or new files are re-embedded. Deleted files are detected by comparing the current file set against the SQLite index, and their chunks are removed from both stores.

File Watcher

When watch = true (default), an IndexWatcher monitors project files for changes during the session. On file modification, the changed file is automatically re-indexed via reindex_file() without rebuilding the entire index. The watcher uses 1-second debounce to batch rapid changes and only processes files with indexable extensions.

Disable with:

[index]
watch = false

Repo Map

A lightweight structural map of the project, generated via tree-sitter signature extraction (no function bodies). Included in the system prompt and cached with a configurable TTL (default: 5 minutes) to avoid per-message filesystem traversal.

Example output:

<repo_map>
  src/agent.rs :: struct:Agent, impl:Agent, fn:new, fn:run, fn:prepare_context
  src/config.rs :: struct:Config, fn:load
  src/main.rs :: fn:main, fn:setup_logging
  ... and 12 more files
</repo_map>

The map is budget-constrained (default: 1024 tokens) and sorted by symbol count (files with more symbols appear first). It gives the model a structural overview of the project without consuming significant context.

Budget-Aware Retrieval

Retrieved chunks are packed into a token budget (default: 40% of available context for code). Chunks are sorted by similarity score and greedily packed until the budget is exhausted. A minimum score threshold (default: 0.25) filters low-relevance results.

Retrieved code is injected as a transient <code_context> XML block before the conversation history. It is re-generated on every turn and never persisted.

Context Window Layout (with Code RAG)

When code indexing is enabled, the context window includes two additional sections:

+---------------------------------------------------+
| System prompt + environment + ZEPH.md             |
+---------------------------------------------------+
| <repo_map> (structural overview, cached)          |  <= 1024 tokens
+---------------------------------------------------+
| <available_skills>                                |
+---------------------------------------------------+
| <code_context> (per-query RAG chunks, transient)  |  <= 30% available
+---------------------------------------------------+
| [semantic recall] past messages                   |  <= 10% available
+---------------------------------------------------+
| Recent message history                            |  <= 50% available
+---------------------------------------------------+
| [response reserve]                                |  20% of total
+---------------------------------------------------+

Configuration

[index]
# Enable codebase indexing for semantic code search.
# Requires Qdrant running (uses separate collection "zeph_code_chunks").
enabled = false

# Auto-index on startup and re-index changed files during session.
auto_index = true

# Directories to index (relative to cwd).
paths = ["."]

# Patterns to exclude (in addition to .gitignore).
exclude = ["target", "node_modules", ".git", "vendor", "dist", "build", "__pycache__"]

# Token budget for repo map in system prompt (0 = no repo map).
repo_map_budget = 1024

# Cache TTL for repo map in seconds (avoids per-message regeneration).
repo_map_ttl_secs = 300

[index.chunker]
# Target chunk size in non-whitespace characters (~300-400 tokens).
target_size = 600
# Maximum chunk size before forced split.
max_size = 1200
# Minimum chunk size — smaller chunks merge with siblings.
min_size = 100

[index.retrieval]
# Maximum chunks to fetch from Qdrant (before budget packing).
max_chunks = 12
# Minimum cosine similarity score to accept.
score_threshold = 0.25
# Maximum fraction of available context budget for code chunks.
budget_ratio = 0.40

Supported Languages

Language support is controlled by feature flags on the zeph-index crate. All default features are enabled when the index binary feature is active.

LanguageFeatureExtensions
Rustlang-rust.rs
Pythonlang-python.py, .pyi
JavaScriptlang-js.js, .jsx, .mjs, .cjs
TypeScriptlang-js.ts, .tsx, .mts, .cts
Golang-go.go
Bashlang-config.sh, .bash, .zsh
TOMLlang-config.toml
JSONlang-config.json, .jsonc
Markdownlang-config.md, .markdown

Environment Variables

VariableDescriptionDefault
ZEPH_INDEX_ENABLEDEnable code indexingfalse
ZEPH_INDEX_AUTO_INDEXAuto-index on startuptrue
ZEPH_INDEX_REPO_MAP_BUDGETToken budget for repo map1024
ZEPH_INDEX_REPO_MAP_TTL_SECSCache TTL for repo map in seconds300

Embedding Model Recommendations

The indexer uses the same LlmProvider.embed() as semantic memory. Any embedding model works. For code-heavy workloads:

ModelDimsNotes
qwen3-embedding1024Current Zeph default, good general performance
nomic-embed-text768Lightweight universal model
nomic-embed-code768Optimized for code, higher RAM (~7.5GB)

Architecture Overview

Cargo workspace (Edition 2024, resolver 3) with 10 crates + binary root.

Requires Rust 1.88+. Native async traits are used throughout — no async-trait crate.

Workspace Layout

zeph (binary) — thin bootstrap glue
├── zeph-core       Agent loop, config, config hot-reload, channel trait, context builder
├── zeph-llm        LlmProvider trait, Ollama + Claude + OpenAI + Candle backends, orchestrator, embeddings
├── zeph-skills     SKILL.md parser, registry with lazy body loading, embedding matcher, resource resolver, hot-reload
├── zeph-memory     SQLite + Qdrant, SemanticMemory orchestrator, summarization
├── zeph-channels   Telegram adapter (teloxide) with streaming
├── zeph-tools      ToolExecutor trait, ShellExecutor, WebScrapeExecutor, CompositeExecutor
├── zeph-index      AST-based code indexing, hybrid retrieval, repo map (optional)
├── zeph-mcp        MCP client via rmcp, multi-server lifecycle, unified tool matching (optional)
├── zeph-a2a        A2A protocol client + server, agent discovery, JSON-RPC 2.0 (optional)
└── zeph-tui        ratatui TUI dashboard with real-time metrics (optional)

Dependency Graph

zeph (binary)
  └── zeph-core (orchestrates everything)
        ├── zeph-llm (leaf)
        ├── zeph-skills (leaf)
        ├── zeph-memory (leaf)
        ├── zeph-channels (leaf)
        ├── zeph-tools (leaf)
        ├── zeph-index (optional, leaf)
        ├── zeph-mcp (optional, leaf)
        ├── zeph-a2a (optional, leaf)
        └── zeph-tui (optional, leaf)

zeph-core is the only crate that depends on other workspace crates. All leaf crates are independent and can be tested in isolation.

Agent Loop

The agent loop processes user input in a continuous cycle:

  1. Read initial user message via channel.recv()
  2. Build context from skills, memory, and environment
  3. Stream LLM response token-by-token
  4. Execute any tool calls in the response
  5. Drain queued messages (if any) via channel.try_recv() and repeat from step 2

Queued messages are processed sequentially with full context rebuilding between each. Consecutive messages within 500ms are merged to reduce fragmentation. The queue holds a maximum of 10 messages; older messages are dropped when full.

Key Design Decisions

  • Generic Agent: Agent<P: LlmProvider + Clone + 'static, C: Channel, T: ToolExecutor> — fully generic over provider, channel, and tool executor
  • TLS: rustls everywhere (no openssl-sys)
  • Errors: thiserror for library crates, anyhow for application code (zeph-core, main.rs)
  • Lints: workspace-level clippy::all + clippy::pedantic + clippy::nursery; unsafe_code = "deny"
  • Dependencies: versions only in root [workspace.dependencies]; crates inherit via workspace = true
  • Feature gates: optional crates (zeph-index, zeph-mcp, zeph-a2a, zeph-tui) are feature-gated in the binary
  • Context engineering: proportional budget allocation, semantic recall injection, message trimming, runtime compaction, environment context injection, progressive skill loading, ZEPH.md project config discovery

Crates

Each workspace crate has a focused responsibility. All leaf crates are independent and testable in isolation; only zeph-core depends on other workspace members.

zeph-core

Agent loop, configuration loading, and context builder.

  • Agent<P, C, T> — main agent loop with streaming support, message queue drain, configurable max_tool_iterations (default 10), doom-loop detection, and context budget check (stops at 80% threshold)
  • Config — TOML config loading with env var overrides
  • Channel trait — abstraction for I/O (CLI, Telegram, TUI) with recv(), try_recv(), send_queue_count() for queue management
  • Context builder — assembles system prompt from skills, memory, summaries, environment, and project config
  • Context engineering — proportional budget allocation, semantic recall injection, message trimming, runtime compaction
  • EnvironmentContext — runtime gathering of cwd, git branch, OS, model name
  • project.rs — ZEPH.md config discovery (walk up directory tree)
  • VaultProvider trait — pluggable secret resolution
  • MetricsSnapshot / MetricsCollector — real-time metrics via tokio::sync::watch for TUI dashboard

zeph-llm

LLM provider abstraction and backend implementations.

  • LlmProvider trait — chat(), chat_stream(), embed(), supports_streaming(), supports_embeddings()
  • OllamaProvider — local inference via ollama-rs
  • ClaudeProvider — Anthropic Messages API with SSE streaming
  • OpenAiProvider — OpenAI + compatible APIs (raw reqwest)
  • CandleProvider — local GGUF model inference via candle
  • AnyProvider — enum dispatch for runtime provider selection
  • ModelOrchestrator — task-based multi-model routing with fallback chains

zeph-skills

SKILL.md loader, skill registry, and prompt formatter.

  • SkillMeta / Skill — metadata + lazy body loading via OnceLock
  • SkillRegistry — manages skill lifecycle, lazy body access
  • SkillMatcher — in-memory cosine similarity matching
  • QdrantSkillMatcher — persistent embeddings with BLAKE3 delta sync
  • format_skills_prompt() — assembles prompt with OS-filtered resources
  • format_skills_catalog() — description-only entries for non-matched skills
  • resource.rsdiscover_resources() + load_resource() with path traversal protection
  • Filesystem watcher for hot-reload (500ms debounce)

zeph-memory

SQLite-backed conversation persistence with Qdrant vector search.

  • SqliteStore — conversations, messages, summaries, skill usage, skill versions
  • QdrantStore — vector storage and cosine similarity search
  • SemanticMemory<P> — orchestrator coordinating SQLite + Qdrant + LlmProvider
  • Automatic collection creation, graceful degradation without Qdrant

zeph-channels

Channel implementations for the Zeph agent.

  • CliChannel — stdin/stdout with immediate streaming output, blocking recv (queue always empty)
  • TelegramChannel — teloxide adapter with MarkdownV2 rendering, streaming via edit-in-place, user whitelisting, inline confirmation keyboards, mpsc-backed message queue with 500ms merge window

zeph-tools

Tool execution abstraction and shell backend.

  • ToolExecutor trait — accepts LLM response or structured ToolCall, returns tool output
  • ToolRegistry — typed definitions for 7 built-in tools (bash, read, edit, write, glob, grep, web_scrape), injected into system prompt as <tools> catalog
  • ToolCall / execute_tool_call() — structured tool invocation with typed parameters alongside legacy bash extraction (dual-mode)
  • FileExecutor — sandboxed file operations (read, write, edit, glob, grep) with ancestor-walk path canonicalization
  • ShellExecutor — bash block parser, command safety filter, sandbox validation
  • WebScrapeExecutor — HTML scraping with CSS selectors, SSRF protection
  • CompositeExecutor<A, B> — generic chaining with first-match-wins dispatch, routes structured tool calls by tool_id to the appropriate backend
  • AuditLogger — structured JSON audit trail for all executions
  • truncate_tool_output() — head+tail split at 30K chars with UTF-8 safe boundaries

zeph-index

AST-based code indexing, semantic retrieval, and repo map generation (optional, feature-gated).

  • Lang enum — supported languages with tree-sitter grammar registry, feature-gated per language group
  • chunk_file() — AST-based chunking with greedy sibling merge, scope chains, import extraction
  • contextualize_for_embedding() — prepends file path, scope, language, imports to code for better embedding quality
  • CodeStore — dual-write storage: Qdrant vectors (zeph_code_chunks collection) + SQLite metadata with BLAKE3 content-hash change detection
  • CodeIndexer<P> — project indexer orchestrator: walk, chunk, embed, store with incremental skip of unchanged chunks
  • CodeRetriever<P> — hybrid retrieval with query classification (Semantic / Grep / Hybrid), budget-aware chunk packing
  • generate_repo_map() — compact structural view via tree-sitter signature extraction, budget-constrained

zeph-mcp

MCP client for external tool servers (optional, feature-gated).

  • McpClient / McpManager — multi-server lifecycle management
  • McpToolExecutor — tool execution via MCP protocol
  • McpToolRegistry — tool embeddings in Qdrant with delta sync
  • Dual transport: Stdio (child process) and HTTP (Streamable HTTP)
  • Dynamic server management via /mcp add, /mcp remove

zeph-a2a

A2A protocol client and server (optional, feature-gated).

  • A2aClient — JSON-RPC 2.0 client with SSE streaming
  • AgentRegistry — agent card discovery with TTL cache
  • AgentCardBuilder — construct agent cards from runtime config
  • A2A Server — axum-based HTTP server with bearer auth, rate limiting, body size limits
  • TaskManager — in-memory task lifecycle management

zeph-tui

ratatui-based TUI dashboard (optional, feature-gated).

  • TuiChannel — Channel trait implementation bridging agent loop and TUI render loop via mpsc, oneshot-based confirmation dialog, bounded message queue (max 10) with 500ms merge window
  • App — TUI state machine with Normal/Insert/Confirm modes, keybindings, scroll, live metrics polling via watch::Receiver, queue badge indicator [+N queued], Ctrl+K to clear queue
  • EventReader — crossterm event loop on dedicated OS thread (avoids tokio starvation)
  • Side panel widgets: skills (active/total), memory (SQLite, Qdrant, embeddings), resources (tokens, API calls, latency)
  • Chat widget with bottom-up message feed, pulldown-cmark markdown rendering, scrollbar with proportional thumb, mouse scroll, thinking block segmentation, and streaming cursor
  • Splash screen widget with colored block-letter banner
  • Conversation history loading from SQLite on startup
  • Confirmation modal overlay widget with Y/N keybindings and focus capture
  • Responsive layout: side panels hidden on terminals < 80 cols
  • Multiline input via Shift+Enter
  • Status bar with mode, skill count, tokens, Qdrant status, uptime
  • Panic hook for terminal state restoration
  • Re-exports MetricsSnapshot / MetricsCollector from zeph-core

Token Efficiency

Zeph’s prompt construction is designed to minimize token usage regardless of how many skills and MCP tools are installed.

The Problem

Naive AI agent implementations inject all available tools and instructions into every prompt. With 50 skills and 100 MCP tools, this means thousands of tokens consumed on every request — most of which are irrelevant to the user’s query.

Zeph’s Approach

Embedding-Based Selection

Per query, only the top-K most relevant skills (default: 5) are selected via cosine similarity of vector embeddings. The same pipeline handles MCP tools.

User query → embed(query) → cosine_similarity(query, skills) → top-K → inject into prompt

This makes prompt size O(K) instead of O(N), where:

  • K = max_active_skills (default: 5, configurable)
  • N = total skills + MCP tools installed

Progressive Loading

Even selected skills don’t load everything at once:

StageWhat loadsWhenToken cost
StartupSkill metadata (name, description)Once~100 tokens per skill
QuerySkill body (instructions, examples)On match<5000 tokens per skill
QueryResource files (references, scripts)On match + OS filterVariable

Metadata is always in memory for matching. Bodies are loaded lazily via OnceLock and cached after first access. Resources are loaded on demand with OS filtering (e.g., linux.md only loads on Linux).

Two-Tier Skill Catalog

Non-matched skills are listed in a description-only <other_skills> catalog — giving the model awareness of all available capabilities without injecting their full bodies. This means the model can request a specific skill if needed, while consuming only ~20 tokens per unmatched skill instead of thousands.

MCP Tool Matching

MCP tools follow the same pipeline:

  • Tools are embedded in Qdrant (zeph_mcp_tools collection) with BLAKE3 content-hash delta sync
  • Only re-embedded when tool definitions change
  • Unified matching ranks both skills and MCP tools by relevance score
  • Prompt contains only the top-K combined results

Practical Impact

ScenarioNaive approachZeph
10 skills, no MCP~50K tokens/prompt~25K tokens/prompt
50 skills, 100 MCP tools~250K tokens/prompt~25K tokens/prompt
200 skills, 500 MCP tools~1M tokens/prompt~25K tokens/prompt

Prompt size stays constant as you add more capabilities. The only cost of more skills is a slightly larger embedding index in Qdrant or memory.

Two-Tier Context Pruning

Long conversations accumulate tool outputs that consume significant context space. Zeph uses a two-tier strategy: Tier 1 selectively prunes old tool outputs (cheap, no LLM call), and Tier 2 falls back to full LLM compaction only when Tier 1 is insufficient. See Context Engineering for details.

Configuration

[skills]
max_active_skills = 5  # Increase for broader context, decrease for faster/cheaper queries
export ZEPH_SKILLS_MAX_ACTIVE=3  # Override via env var

Security

Zeph implements defense-in-depth security for safe AI agent operations in production environments.

Shell Command Filtering

All shell commands from LLM responses pass through a security filter before execution. Commands matching blocked patterns are rejected with detailed error messages.

12 blocked patterns by default:

PatternRisk CategoryExamples
rm -rf /, rm -rf /*Filesystem destructionPrevents accidental system wipe
sudo, suPrivilege escalationBlocks unauthorized root access
mkfs, fdiskFilesystem operationsPrevents disk formatting
dd if=, dd of=Low-level disk I/OBlocks dangerous write operations
curl | bash, wget | shArbitrary code executionPrevents remote code injection
nc, ncat, netcatNetwork backdoorsBlocks reverse shell attempts
shutdown, reboot, haltSystem controlPrevents service disruption

Configuration:

[tools.shell]
timeout = 30
blocked_commands = ["custom_pattern"]  # Additional patterns (additive to defaults)
allowed_paths = ["/home/user/workspace"]  # Restrict filesystem access
allow_network = true  # false blocks curl/wget/nc
confirm_patterns = ["rm ", "git push -f"]  # Destructive command patterns

Custom blocked patterns are additive — you cannot weaken default security. Matching is case-insensitive.

Shell Sandbox

Commands are validated against a configurable filesystem allowlist before execution:

  • allowed_paths = [] (default) restricts access to the working directory only
  • Paths are canonicalized to prevent traversal attacks (../../etc/passwd)
  • allow_network = false blocks network tools (curl, wget, nc, ncat, netcat)

Destructive Command Confirmation

Commands matching confirm_patterns trigger an interactive confirmation before execution:

  • CLI: y/N prompt on stdin
  • Telegram: inline keyboard with Confirm/Cancel buttons
  • Default patterns: rm, git push -f, git push --force, drop table, drop database, truncate
  • Configurable via tools.shell.confirm_patterns in TOML

File Executor Sandbox

FileExecutor enforces the same allowed_paths sandbox as the shell executor for all file operations (read, write, edit, glob, grep).

Path validation:

  • All paths are resolved to absolute form and canonicalized before access
  • Non-existing paths (e.g., for write) use ancestor-walk canonicalization: the resolver walks up the path tree to the nearest existing ancestor, canonicalizes it, then re-appends the remaining segments. This prevents symlink and .. traversal on paths that do not yet exist on disk
  • If the resolved path does not fall under any entry in allowed_paths, the operation is rejected with a SandboxViolation error

Glob and grep enforcement:

  • glob results are post-filtered: matched paths outside the sandbox are silently excluded
  • grep validates the search root directory before scanning begins

Configuration is shared with the shell sandbox:

[tools.shell]
allowed_paths = ["/home/user/workspace"]  # Empty = cwd only

Permission Policy

The [tools.permissions] config section provides fine-grained, pattern-based access control for each tool. Rules are evaluated in order (first match wins) using case-insensitive glob patterns against the tool input. See Tool System — Permissions for configuration details.

Key security properties:

  • Tools with all-deny rules are excluded from the LLM system prompt, preventing the model from attempting to use them
  • Legacy blocked_commands and confirm_patterns are auto-migrated to equivalent permission rules when [tools.permissions] is absent
  • Default action when no rule matches is Ask (confirmation required)

Audit Logging

Structured JSON audit log for all tool executions:

[tools.audit]
enabled = true
destination = "./data/audit.jsonl"  # or "stdout"

Each entry includes timestamp, tool name, command, result (success/blocked/error/timeout), and duration in milliseconds.

Secret Redaction

LLM responses are scanned for common secret patterns before display:

  • Detected patterns: sk-, AKIA, ghp_, gho_, xoxb-, xoxp-, sk_live_, sk_test_, -----BEGIN
  • Secrets replaced with [REDACTED] preserving original whitespace formatting
  • Enabled by default (security.redact_secrets = true), applied to both streaming and non-streaming responses

Timeout Policies

Configurable per-operation timeouts prevent hung connections:

[timeouts]
llm_seconds = 120       # LLM chat completion
embedding_seconds = 30  # Embedding generation
a2a_seconds = 30        # A2A remote calls

A2A Network Security

  • TLS enforcement: a2a.require_tls = true rejects HTTP endpoints (HTTPS only)
  • SSRF protection: a2a.ssrf_protection = true blocks private IP ranges (RFC 1918, loopback, link-local) via DNS resolution
  • Payload limits: a2a.max_body_size caps request body (default: 1 MiB)

Safe execution model:

  • Commands parsed for blocked patterns, then sandbox-validated, then confirmation-checked
  • Timeout enforcement (default: 30s, configurable)
  • Full errors logged to system, sanitized messages shown to users
  • Audit trail for all tool executions (when enabled)

Container Security

Security LayerImplementationStatus
Base imageOracle Linux 9 SlimProduction-hardened
Vulnerability scanningTrivy in CI/CD0 HIGH/CRITICAL CVEs
User privilegesNon-root zeph user (UID 1000)Enforced
Attack surfaceMinimal package installationDistroless-style

Continuous security:

  • Every release scanned with Trivy before publishing
  • Automated Dependabot PRs for dependency updates
  • cargo-deny checks in CI for license/vulnerability compliance

Code Security

Rust-native memory safety guarantees:

  • Minimal unsafe: One audited unsafe block behind candle feature flag (memory-mapped safetensors loading). Core crates enforce #![deny(unsafe_code)]
  • No panic in production: unwrap() and expect() linted via clippy
  • Secure dependencies: All crates audited with cargo-deny
  • MSRV policy: Rust 1.88+ (Edition 2024) for latest security patches

Reporting Vulnerabilities

Do not open a public issue. Use GitHub Security Advisories to submit a private report.

Include: description, steps to reproduce, potential impact, suggested fix. Expect an initial response within 72 hours.

Feature Flags

Zeph uses Cargo feature flags to control optional functionality. Default features cover common use cases; platform-specific and experimental features are opt-in.

FeatureDefaultDescription
a2aEnabledA2A protocol client and server for agent-to-agent communication
openaiEnabledOpenAI-compatible provider (GPT, Together, Groq, Fireworks, etc.)
mcpEnabledMCP client for external tool servers via stdio/HTTP transport
candleEnabledLocal HuggingFace model inference via candle (GGUF quantized models)
orchestratorEnabledMulti-model routing with task-based classification and fallback chains
self-learningEnabledSkill evolution via failure detection, self-reflection, and LLM-generated improvements
vault-ageEnabledAge-encrypted vault backend for file-based secret storage (age)
indexEnabledAST-based code indexing and semantic retrieval via tree-sitter (guide)
tuiDisabledratatui-based TUI dashboard with real-time agent metrics
metalDisabledMetal GPU acceleration for candle on macOS (implies candle)
cudaDisabledCUDA GPU acceleration for candle on Linux (implies candle)

Build Examples

cargo build --release                                     # all default features
cargo build --release --features metal                    # macOS with Metal GPU
cargo build --release --features cuda                     # Linux with NVIDIA GPU
cargo build --release --features tui                      # with TUI dashboard
cargo build --release --no-default-features               # minimal binary

zeph-index Language Features

When index is enabled, tree-sitter grammars are controlled by sub-features on the zeph-index crate. All are enabled by default.

FeatureLanguages
lang-rustRust
lang-pythonPython
lang-jsJavaScript, TypeScript
lang-goGo
lang-configBash, TOML, JSON, Markdown

Contributing

Thank you for considering contributing to Zeph.

Getting Started

  1. Fork the repository
  2. Clone your fork and create a branch from main
  3. Install Rust 1.88+ (Edition 2024 required)
  4. Run cargo build to verify the setup

Development

Build

cargo build

Test

# Run unit tests only (exclude integration tests)
cargo nextest run --workspace --lib --bins

# Run all tests including integration tests (requires Docker)
cargo nextest run --workspace --profile ci

Nextest profiles (.config/nextest.toml):

  • default: Runs all tests (unit + integration)
  • ci: CI environment, runs all tests with JUnit XML output for reporting

Integration Tests

Integration tests use testcontainers-rs to automatically spin up Docker containers for external services (Qdrant, etc.).

Prerequisites: Docker must be running on your machine.

# Run only integration tests
cargo nextest run --workspace --test '*integration*'

# Run unit tests only (skip integration tests)
cargo nextest run --workspace --lib --bins

# Run all tests
cargo nextest run --workspace

Integration test files are located in each crate’s tests/ directory and follow the *_integration.rs naming convention.

Lint

cargo +nightly fmt --check
cargo clippy --all-targets

Coverage

cargo llvm-cov --all-features --workspace

Workspace Structure

CratePurpose
zeph-coreAgent loop, config, channel trait
zeph-llmLlmProvider trait, Ollama + Claude + OpenAI + Candle backends
zeph-skillsSKILL.md parser, registry, prompt formatter
zeph-memorySQLite conversation persistence, Qdrant vector search
zeph-channelsTelegram adapter
zeph-toolsTool executor, shell sandbox, web scraper
zeph-indexAST-based code indexing, semantic retrieval, repo map
zeph-mcpMCP client, multi-server lifecycle
zeph-a2aA2A protocol client and server
zeph-tuiratatui TUI dashboard with real-time metrics

Pull Requests

  1. Create a feature branch: feat/<scope>/<description> or fix/<scope>/<description>
  2. Keep changes focused — one logical change per PR
  3. Add tests for new functionality
  4. Ensure all checks pass: cargo +nightly fmt, cargo clippy, cargo nextest run --lib --bins
  5. Write a clear PR description following the template

Commit Messages

  • Use imperative mood: “Add feature” not “Added feature”
  • Keep the first line under 72 characters
  • Reference related issues when applicable

Code Style

  • Follow workspace clippy lints (pedantic enabled)
  • Use cargo +nightly fmt for formatting
  • Avoid unnecessary comments — code should be self-explanatory
  • Comments are only for cognitively complex blocks

License

By contributing, you agree that your contributions will be licensed under the MIT License.

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog.

See the full CHANGELOG.md in the repository for the complete version history.