Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Why Zeph?

Token Efficiency

Most agent frameworks inject all available tools and instructions into every prompt. Zeph takes a different approach at every layer:

  • Skill selection — only the top-K most relevant skills per query (default: 5) are loaded via embedding similarity. With 50 skills installed, a typical prompt contains ~2,500 tokens of skill context instead of ~50,000. Progressive loading fetches metadata first (~100 tokens each), full body on activation, and resource files on demand.
  • Tool schema filtering — tool definitions are filtered per-turn based on semantic relevance to the current task, removing irrelevant schemas from the context window entirely.
  • TAFC (Think-Augmented Function Calling) — for complex tools, the model reasons about parameter values before committing, reducing error-driven retries that waste tokens.
  • Tool result caching — deterministic tool results are cached within the session, eliminating redundant executions and their token overhead.
  • Semantic response caching — LLM responses are cached by embedding similarity, so semantically equivalent queries reuse previous answers without an API call.

Prompt size is O(K), not O(N) — and every layer actively works to keep it there.

Intelligent Context Management

Long conversations are the norm, not an edge case. Zeph manages context pressure automatically:

  • Structured anchored summarization — summaries follow a typed schema with mandatory sections (goal, files modified, decisions, open questions, next steps), preventing the compressor from silently dropping critical facts.
  • Compaction probe validation — after every summarization, a Q&A probe verifies that key facts survived compression. If the probe fails, the agent falls back to keeping original turns.
  • Subgoal-aware compaction (HiAgent) — during multi-step tasks, the agent tracks the current subgoal and only compresses information that is no longer relevant to it, preserving active working memory.
  • Write-time importance scoring — memory entries receive an importance score at write time based on content markers, information density, and role, so frequently-referenced and explicitly important memories surface higher during retrieval.

Graph Memory

Beyond flat vector search, Zeph builds a structured knowledge graph from conversations:

  • MAGMA typed edges — relationships between entities are classified into five types (Causal, Temporal, Semantic, CoOccurrence, Hierarchical), enabling type-filtered traversal.
  • SYNAPSE spreading activation — retrieval activates a seed entity and propagates through the graph with hop-by-hop decay and lateral inhibition, surfacing multi-hop connections that flat similarity search misses.
  • Community detection — label propagation identifies entity clusters, providing topic-level context for retrieval.

Ask “why did we choose Kafka?” and Zeph follows causal edges from Kafka through the decision graph to surface the original rationale — not just documents that mention the word.

Hybrid Inference

Mix local and cloud models in a single setup. Run embeddings through free local Ollama while routing chat to Claude or OpenAI. The orchestrator classifies tasks and routes them to the best provider with automatic fallback chains — if the primary provider fails, the next one takes over. Thompson Sampling exploration balances cost and quality across providers. Switch providers with a single config change. Any OpenAI-compatible endpoint works out of the box (Together AI, Groq, Fireworks, and others).

Skills-First Architecture

Skills are plain markdown files — easy to write, version control, and share. Zeph matches skills by embedding similarity, not keywords, so “check disk space” finds the system-info skill even without exact keyword overlap. Edit a SKILL.md file and changes apply immediately via hot-reload, no restart required.

Skills evolve autonomously: when the agent detects repeated failures via the multi-language FeedbackDetector (supporting 7 languages), it reflects on the cause and generates improved skill versions. Wilson score re-ranking ensures that well-performing skills surface first.

Task Orchestration

For complex goals, Zeph decomposes work into a task DAG and executes it with parallel scheduling:

  • Plan template caching — successful plans are cached by goal embedding, so similar future requests reuse an adapted template instead of replanning from scratch (50% cost reduction, 27% latency improvement).
  • Tool dependency graph — tools declare ordering constraints (requires for hard gates, prefers for soft boosts), enabling the agent to present tools in the right sequence without hardcoded execution order.

Privacy and Security

Run fully local with Ollama — no API calls, no data leaves your machine. Store API keys in an age-encrypted vault instead of plaintext environment variables. Tools are sandboxed: configure allowed directories, block network access from shell commands, require confirmation for destructive operations like rm or git push --force. Imported skills start in quarantine with restricted tool access until explicitly trusted. Content from untrusted sources (web scraping, tool output, MCP servers) is sanitized through a multi-layer isolation pipeline before reaching the agent.

Multi-Channel

Deploy Zeph across CLI, TUI dashboard, Telegram, Discord, and Slack with consistent feature parity across all channels. The TUI provides real-time metrics, a command palette, and live status indicators for background operations. All 7 channels support the same 16-method Channel trait — no feature is silently missing in any mode.

Lightweight and Fast

Zeph compiles to a single Rust binary (~12 MB). No Python runtime, no Node.js, no JVM dependency. Native async throughout with no garbage collector overhead. Builds and runs on Linux, macOS, and Windows across x86_64 and ARM64 architectures.