Context Budgets

Zeph manages how much of the LLM’s context window is used for each category of information. When context_budget_tokens is set, the available space is divided proportionally so that no single category dominates the prompt.

Budget Allocation

Category	Share	What it contains
Summaries	15%	Compressed conversation history from past compaction events
Semantic recall	25%	Relevant messages retrieved from past sessions via vector search
Recent history	60%	The most recent messages in the current conversation

The remaining space is used for the system prompt, active skills, graph memory facts (4% when enabled), and tool schemas.

[agent]
context_budget_tokens = 128000   # 0 = auto-detect (default)

When left at 0, Zeph queries the provider for its context window size and uses that as the budget. If the provider does not report a context window (e.g., some local models), Zeph falls back to 128,000 tokens as a safe default. This fallback also applies during reload_config() to prevent unbounded memory growth. Set this value explicitly to override auto-detection (e.g., 128000 for a 200K-token model with margin for the response).

BATS Budget Hints

Budget-Aware Token Steering (BATS) injects a hint into the system prompt that tells the LLM how much context space remains. This helps the model:

Produce appropriately-sized responses instead of exhausting the remaining budget
Decide whether to call a tool (which adds tokens) or answer from existing context
Choose concise tool arguments when budget is tight

BATS also implements a utility-based action policy that evaluates each turn against five action categories:

Action	When preferred
Respond	Enough context to answer directly
Search	Information gap detected, memory search likely to help
Tool-use	Task requires external action (shell, file, web)
Delegate	Sub-task is independent enough for a sub-agent
Wait	Ambiguous request, better to ask for clarification

The action with the highest expected utility given the current budget and conversation state is selected. This prevents the agent from making expensive tool calls when the budget is nearly exhausted.

Skill Prompt Modes

When context budget is tight, skill injection adapts automatically:

Mode	Behavior
`auto` (default)	Full skill bodies when budget allows, compact XML when tight
`compact`	Always use condensed format (~80% smaller)
`full`	Always inject full skill bodies

[skills]
prompt_mode = "auto"   # "auto", "compact", or "full"

In compact mode, only the skill name, description, and trigger phrases are included — the full body is omitted. This keeps skill matching functional even when the context window is nearly full.

Compaction Tiers

When messages exceed the budget, Zeph applies two tiers of compression:

Soft compaction (at 70% of budget) — prunes old tool outputs and applies pre-computed deferred summaries. No LLM call needed.
Hard compaction (at 90% of budget) — runs chunked LLM-based summarization. Messages are split into ~4096-token chunks, summarized in parallel, then merged.

Both tiers use dual-visibility flags: original messages become hidden from the LLM but remain visible in the UI. Summaries are visible to the LLM but hidden from the UI.

[memory]
soft_compaction_threshold = 0.70   # fraction of budget (default: 0.70)
hard_compaction_threshold = 0.90   # fraction of budget (default: 0.90)

Next Steps

Context Engineering — full compaction pipeline, proactive compression, and tuning
Memory and Context — how memory and context work together
Token Efficiency — how tokens are counted and optimized