Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Semantic Memory

Enable semantic search to retrieve contextually relevant messages from conversation history using vector similarity.

Requires an embedding model. Ollama with qwen3-embedding is the default. Claude API does not support embeddings natively — use the orchestrator to route embeddings through Ollama while using Claude for chat.

Vector Backend

Zeph supports two vector backends for storing embeddings:

BackendBest forExternal dependencies
qdrant (default)Production, multi-user, large datasetsQdrant server
sqliteDevelopment, single-user, offline, quick setupNone

The sqlite backend stores vectors in the same SQLite database as conversation history and performs cosine similarity search in-process. It requires no external services, making it ideal for local development and single-user deployments.

Setup with SQLite Backend (Quickstart)

No external services needed:

[memory]
vector_backend = "sqlite"

[memory.semantic]
enabled = true
recall_limit = 5

The vector tables are created automatically via migration 011_vector_store.sql.

Setup with Qdrant Backend

  1. Start Qdrant:

    docker compose up -d qdrant
    
  2. Enable semantic memory in config:

    [memory]
    vector_backend = "qdrant"  # default, can be omitted
    
    [memory.semantic]
    enabled = true
    recall_limit = 5
    
  3. Automatic setup: Qdrant collection (zeph_conversations) is created automatically on first use with correct vector dimensions (1024 for qwen3-embedding) and Cosine distance metric. No manual initialization required.

How It Works

  • Hybrid search: Recall uses both Qdrant vector similarity and SQLite FTS5 keyword search, merging results with configurable weights. This improves recall quality especially for exact term matches.
  • Automatic embedding: Messages are embedded asynchronously using the configured embedding_model and stored in Qdrant alongside SQLite.
  • FTS5 index: All messages are automatically indexed in an SQLite FTS5 virtual table via triggers, enabling BM25-ranked keyword search with zero configuration.
  • Graceful degradation: If Qdrant is unavailable, Zeph falls back to FTS5-only keyword search instead of returning empty results.
  • Startup backfill: On startup, if Qdrant is available, Zeph calls embed_missing() to backfill embeddings for any messages stored while Qdrant was offline.

Hybrid Search Weights

Configure the balance between vector (semantic) and keyword (BM25) search:

[memory.semantic]
enabled = true
recall_limit = 5
vector_weight = 0.7   # Weight for Qdrant vector similarity
keyword_weight = 0.3  # Weight for FTS5 keyword relevance

When Qdrant is unavailable, only keyword search runs (effectively keyword_weight = 1.0).

Temporal Decay

Enable time-based score attenuation to prefer recent context over stale information:

[memory.semantic]
temporal_decay_enabled = true
temporal_decay_half_life_days = 30  # Score halves every 30 days

Scores decay exponentially: at 1 half-life a message retains 50% of its original score, at 2 half-lives 25%, and so on. Adjust temporal_decay_half_life_days based on how quickly your project context changes.

MMR Re-ranking

Enable Maximal Marginal Relevance to diversify recall results and reduce redundancy:

[memory.semantic]
mmr_enabled = true
mmr_lambda = 0.7  # 0.0 = max diversity, 1.0 = pure relevance

MMR iteratively selects results that are both relevant to the query and dissimilar to already-selected items. The default mmr_lambda = 0.7 works well for most use cases. Lower it if you see too many semantically similar results in recall.

Autosave Assistant Responses

By default, only user messages are embedded. Enable autosave_assistant to also embed assistant responses for richer semantic recall:

[memory]
autosave_assistant = true
autosave_min_length = 20  # Skip embedding for very short replies

Short responses (below autosave_min_length bytes) are still saved to SQLite but skip the embedding step. User messages always generate embeddings regardless of this setting.

Memory Export and Import

Back up or migrate conversation data with portable JSON snapshots:

zeph memory export conversations.json
zeph memory import conversations.json

See CLI Reference — zeph memory for details.

Semantic Response Caching

Complement exact-match response caching with embedding-based similarity matching:

[llm]
response_cache_enabled = true
semantic_cache_enabled = true          # Enable semantic cache (default: false)
semantic_cache_threshold = 0.95        # Cosine similarity for cache hit (default: 0.95)
semantic_cache_max_candidates = 10     # Max entries examined per lookup (default: 10)

Lower the threshold (e.g., 0.92) for more cache hits with slightly less precise matching. Increase semantic_cache_max_candidates for better recall at the cost of lookup latency.

Write-Time Importance Scoring

Score messages by decision-relevance at write time to improve recall quality:

[memory.semantic]
importance_enabled = true         # Enable importance scoring (default: false)
importance_weight = 0.15          # Blend weight in recall ranking (default: 0.15)

Messages with high importance scores (architectural decisions, key constraints, user preferences) receive a recall boost proportional to importance_weight. The score is computed by an LLM classifier at message persist time and stored in the importance_score column (migration 039).

SleepGate: Automatic Forgetting

Over time, the vector index accumulates stale embeddings. Enable SleepGate to periodically remove low-value entries:

[memory.forgetting]
enabled = true
interval_secs = 86400          # Run every 24 hours (default)
retention_threshold = 0.30     # Score below which entries are forgotten (default: 0.30)

SleepGate scores entries on recency, access frequency, and semantic density. A built-in compression predictor preserves load-bearing entries even if their retention score is low.

Forgotten entries are soft-deleted — removed from the vector index but retained in SQLite for potential restoration.

See SleepGate for tuning guidelines and interaction with other memory features.

Storage Architecture

StorePurpose
SQLiteSource of truth for message text, conversations, summaries, skill usage
Qdrant or SQLite vectorsVector index for semantic similarity search (embeddings only)

Both stores work together: SQLite holds the data, the vector backend enables similarity search over it. With the Qdrant backend, the embeddings_metadata table in SQLite maps message IDs to Qdrant point IDs. With the SQLite backend, vectors are stored directly in vector_points and vector_point_payloads tables.

The messages table includes agent_visible, user_visible, and compacted_at columns (migration 013_message_metadata.sql) plus an index on conversation_id. Semantic recall and FTS5 keyword search filter by agent_visible=1, ensuring compacted messages are excluded from retrieval results.