Token Efficiency
Zeph’s prompt construction is designed to minimize token usage regardless of how many skills and MCP tools are installed.
The Problem
Naive AI agent implementations inject all available tools and instructions into every prompt. With 50 skills and 100 MCP tools, this means thousands of tokens consumed on every request — most of which are irrelevant to the user’s query.
Zeph’s Approach
Embedding-Based Selection
Per query, only the top-K most relevant skills (default: 5) are selected via cosine similarity of vector embeddings. The same pipeline handles MCP tools.
User query → embed(query) → cosine_similarity(query, skills) → top-K → inject into prompt
This makes prompt size O(K) instead of O(N), where:
- K =
max_active_skills(default: 5, configurable) - N = total skills + MCP tools installed
Progressive Loading
Even selected skills don’t load everything at once:
| Stage | What loads | When | Token cost |
|---|---|---|---|
| Startup | Skill metadata (name, description) | Once | ~100 tokens per skill |
| Query | Skill body (instructions, examples) | On match | <5000 tokens per skill |
| Query | Resource files (references, scripts) | On match + OS filter | Variable |
Metadata is always in memory for matching. Bodies are loaded lazily via OnceLock and cached after first access. Resources are loaded on demand with OS filtering (e.g., linux.md only loads on Linux).
Two-Tier Skill Catalog
Non-matched skills are listed in a description-only <other_skills> catalog — giving the model awareness of all available capabilities without injecting their full bodies. This means the model can request a specific skill if needed, while consuming only ~20 tokens per unmatched skill instead of thousands.
MCP Tool Matching
MCP tools follow the same pipeline:
- Tools are embedded in Qdrant (
zeph_mcp_toolscollection) with BLAKE3 content-hash delta sync - Only re-embedded when tool definitions change
- Unified matching ranks both skills and MCP tools by relevance score
- Prompt contains only the top-K combined results
Practical Impact
| Scenario | Naive approach | Zeph |
|---|---|---|
| 10 skills, no MCP | ~50K tokens/prompt | ~25K tokens/prompt |
| 50 skills, 100 MCP tools | ~250K tokens/prompt | ~25K tokens/prompt |
| 200 skills, 500 MCP tools | ~1M tokens/prompt | ~25K tokens/prompt |
Prompt size stays constant as you add more capabilities. The only cost of more skills is a slightly larger embedding index in Qdrant or memory.
Output Filter Pipeline
Tool output is compressed before it enters the LLM context. A command-aware filter pipeline matches each shell command against a set of built-in filters (test runner output, Clippy diagnostics, git log/diff, directory listings, log deduplication) and strips noise while preserving signal. The pipeline runs synchronously inside the tool executor, so the LLM never sees raw output.
Typical savings by command type:
| Command | Raw lines | Filtered lines | Savings |
|---|---|---|---|
cargo test (100 passing, 2 failing) | ~340 | ~30 | ~91% |
cargo clippy (many warnings) | ~200 | ~50 | ~75% |
git log --oneline -50 | 50 | 20 | 60% |
After each filtered execution, CLI mode prints a one-line stats summary and TUI mode accumulates the savings in the Resources panel. See Tool System — Output Filter Pipeline for configuration details.
Token Savings Tracking
MetricsSnapshot tracks cumulative filter metrics across the session:
filter_raw_tokens/filter_saved_tokens— aggregate volume before and after filteringfilter_total_commands/filter_filtered_commands— hit rate denominator/numeratorfilter_confidence_full/partial/fallback— distribution of filter confidence levels
These feed into the TUI filter metrics display and are emitted as tracing::debug! every 50 commands.
Token Counting
TokenCounter (in zeph-memory) provides accurate BPE-based token counting using tiktoken-rs with the cl100k_base tokenizer — the same encoding used by GPT-4 and Claude-compatible APIs. This replaces the previous chars / 4 heuristic.
Key design decisions:
- DashMap cache (10K entry cap) provides amortized O(1) lookups for repeated text fragments (system prompts, skill bodies, tool schemas). Random eviction on overflow keeps memory bounded.
- Input size guard — inputs exceeding 64 KiB bypass BPE encoding and fall back to
chars / 4without caching. This prevents CPU amplification and cache pollution from pathologically large tool outputs. - Graceful fallback — if the tiktoken tokenizer fails to initialize (e.g., missing data files), all counting falls back to
chars / 4silently. - Tool schema counting —
count_tool_schema_tokens()implements the OpenAI function-calling token formula, accounting for per-function overhead, property keys, enum items, and nested object traversal. This enables accurate context budget allocation when tools are registered. - Shared instance — a single
Arc<TokenCounter>is constructed during bootstrap and shared acrossAgentandSemanticMemory, ensuring cache hits are maximized across subsystems.
The token_safety_margin config multiplier (default: 1.0) still applies on top of the counted value for conservative budgeting.
Tiered Context Compaction
Long conversations accumulate tool outputs that consume significant context space. Zeph uses a tiered compaction strategy. The soft tier (soft_compaction_threshold, default 0.70) batch-applies pre-computed tool pair summaries and prunes old tool outputs — both without an LLM call — preserving the message prefix for prompt cache hits. The hard tier (hard_compaction_threshold, default 0.90) first attempts the same lightweight steps, then falls back to adaptive chunked LLM compaction — splitting messages into ~4096-token chunks, summarizing up to 4 in parallel, and merging results.
When hard-tier LLM compaction itself hits a context length error, progressive middle-out tool response removal reduces the input at 10/20/50/100% tiers before retrying. If all LLM attempts fail, a metadata-only fallback produces a summary without any LLM call. LLM calls in the agent loop also reactively intercept context length errors — compacting and retrying up to 2 times before propagating the error. See Context Engineering for details.
Compaction Probe Validation
After hard-tier compaction produces a candidate summary, an optional compaction probe validates that critical facts survived compression. The probe generates factual questions from the original messages, answers them using only the summary, and scores the answers. Verdicts range from Pass (commit summary) through SoftFail (commit with warning) to HardFail (block compaction, preserve originals). See Context Engineering — Compaction Probe for configuration.
Structured Anchored Summarization
The anchored summarization path replaces free-form prose summaries with structured AnchoredSummary objects containing five sections: session intent, files modified, decisions made, open questions, and next steps. The structured format preserves actionable detail more reliably than prose, reducing the rate of compaction probe HardFail verdicts.
Subgoal-Aware Compaction
When task orchestration is active, the SubgoalRegistry prevents compaction from destroying context that active subgoals depend on. Messages within active subgoal ranges are preserved; completed subgoal ranges are aggressively compacted. This makes long multi-step orchestration sessions feasible within bounded context windows.
Message Dual-Visibility
Every Message carries a MessageMetadata struct with two boolean flags — agent_visible and user_visible — that control whether the message is included in the LLM context window, the UI history, or both. By default both flags are true.
Compaction leverages these flags via replace_conversation(): compacted originals are set to agent_visible=false, user_visible=true (preserved for the user to scroll through, hidden from the LLM), while the inserted summary is agent_visible=true, user_visible=false (injected into the LLM context, hidden from the user). This replaces the previous destructive compaction that deleted original messages.
Semantic recall and keyword search (FTS5) filter by agent_visible=1 so compacted messages never pollute retrieval results. History loading supports filtered queries via load_history_filtered(conversation_id, agent_visible, user_visible) for visibility-aware access.
Configuration
[skills]
max_active_skills = 5 # Increase for broader context, decrease for faster/cheaper queries
export ZEPH_SKILLS_MAX_ACTIVE=3 # Override via env var