Model Orchestrator
Tip: For simple fallback chains with adaptive routing (Thompson Sampling or EMA), use
routing = "cascade"orrouting = "thompson"in[llm]instead. See Adaptive Inference.
Note:
routing = "task"was removed as unimplemented in #3248. If your config uses it,--migrate-configwill drop it with a warning and fall back to default single-provider routing.
Use a multi-provider setup to combine local and cloud models — for example, embeddings via Ollama and chat via Claude. Provider selection is controlled via default = true and embed = true markers.
Configuration
[[llm.providers]]
name = "ollama"
type = "ollama"
model = "qwen3:8b"
embedding_model = "qwen3-embedding"
embed = true # use this provider for all embedding operations
[[llm.providers]]
name = "claude"
type = "claude"
model = "claude-sonnet-4-6"
max_tokens = 4096
default = true # default provider for chat
Provider Entry Fields
Each [[llm.providers]] entry supports:
| Field | Type | Description |
|---|---|---|
type | string | Provider backend: ollama, claude, openai, gemini, candle, compatible |
name | string? | Identifier for routing; required for type = "compatible" |
model | string? | Chat model |
base_url | string? | API endpoint (Ollama / Compatible) |
embedding_model | string? | Embedding model |
embed | bool | Mark as the embedding provider for skill matching and semantic memory |
default | bool | Mark as the primary chat provider |
filename | string? | GGUF filename (Candle only) |
device | string? | Compute device: cpu, metal, cuda (Candle only) |
Provider Selection
default = true— provider used for chat when no other routing rule matchesembed = true— provider used for all embedding operations (skill matching, semantic memory)
Capability Delegation
SubProvider and ModelOrchestrator fully delegate capability queries to the underlying provider:
context_window()— returns the actual context window size from the sub-provider. This is required for correctauto_budget, semantic recall sizing, and graph recall budget allocation when using the orchestrator.supports_vision()— returnstrueonly when the active sub-provider supports image inputs.supports_structured_output()— returns the sub-provider’s actual value.last_usage()andlast_cache_usage()— delegate to the last-used provider. Token metrics are accurate even when the orchestrator routes across multiple providers within a session.
Interactive Setup
Run zeph init and select Multi-provider as the LLM setup. The wizard prompts for:
- Primary provider — select from Ollama, Claude, OpenAI, or Compatible. Provide the model name, base URL, and API key as needed.
- Fallback provider — same selection. The fallback activates when the primary fails.
- Embedding model — used for skill matching and semantic memory.
The wizard generates a complete [[llm.providers]] section with named entries and embed/default markers.
Multi-Instance Example
Two Ollama servers on different ports — one for chat, one for embeddings:
[llm]
[[llm.providers]]
name = "ollama-chat"
type = "ollama"
base_url = "http://localhost:11434"
model = "mistral:7b"
default = true
[[llm.providers]]
name = "ollama-embed"
type = "ollama"
base_url = "http://localhost:11435" # second Ollama instance
embedding_model = "nomic-embed-text" # dedicated embedding model
embed = true
Orchestration-Tier Provider Routing
Sub-agent orchestration runs several internal LLM tasks that are distinct from user-facing reasoning:
- Scheduling and aggregation — combining multiple sub-agent outputs into a coherent result
- Predicate evaluation — deciding whether a task completed successfully (true/false classifiers)
- Task verification — double-checking a result before returning it to the user
These tasks can often be handled by smaller/faster models without impacting overall quality. The orchestrator_provider field routes all three through a single dedicated provider:
[[llm.providers]]
name = "fast"
type = "ollama"
model = "qwen3:1.7b"
[[llm.providers]]
name = "quality"
type = "claude"
model = "claude-sonnet-4-6"
default = true
[orchestration]
orchestrator_provider = "fast" # Use fast model for scheduling-tier LLM calls
planner_provider = "quality" # Use quality model for planning (stays on quality provider)
The resolution order is:
LlmAggregator(output synthesis) →orchestrator_provider→ primaryPlanVerifier(verification check) →verify_provider→orchestrator_provider→ primaryPredicateEvaluator(predicate logic) →predicate_provider→orchestrator_provider→ primary
When planner_provider is explicitly set, it is NOT overridden by orchestrator_provider. Planning is a complex task and always uses the quality provider.
Warning
Routing
LlmAggregatorthrough a cheap/fast model may reduce final output quality because aggregation produces user-visible text. Test thoroughly with your workload before relying on this optimization in production.
Admission Control and Concurrency Limits
To prevent provider overcommit when many sub-agents are running, set max_concurrent per provider. This limits the number of simultaneous in-flight orchestration calls to that provider:
[[llm.providers]]
name = "api"
type = "openai"
model = "gpt-4o"
max_concurrent = 10 # Allow up to 10 concurrent sub-agent API calls
[[llm.providers]]
name = "local"
type = "ollama"
model = "qwen3:8b"
max_concurrent = 4 # Ollama server has less capacity
The AdmissionGate enforces these limits at spawn time. When a provider reaches its limit, new tasks are deferred with exponential backoff until a previous task completes and frees a permit.
Currently the concurrency limit is enforced (tasks are delayed), but cost budgets are warn-only: when a task completes with token usage exceeding [orchestration] default_task_budget_cents, a warning is logged but the task is not rejected. Hard budget enforcement is deferred pending per-task CostTracker scoping.
SLM Provider Recommendations
Each Zeph subsystem that calls an LLM exposes a *_provider config field. Matching the model size to task complexity reduces cost and latency without sacrificing quality. The table below lists the recommended model tier for each subsystem:
| Subsystem | Config field | Recommended tier | Rationale |
|---|---|---|---|
| Skill matching | [skills] match_provider | Fast / SLM | Binary relevance signal; a 1.7B–8B model is sufficient |
| Tool-pair summarization | [llm] summary_model or [llm.summary_provider] | Fast / SLM | 1–2 sentence summaries; speed matters more than depth |
| Memory admission (A-MAC) | [memory.admission] admission_provider | Fast / SLM | Binary admit/reject decision; cheap models work well |
| MemScene consolidation | [memory.tiers] scene_provider | Fast / medium | Short scene summaries; medium model improves coherence |
| Compaction probe | [memory.compression.probe] model | Fast / medium | Question answering over a summary; Haiku-class is sufficient |
| Compress context (autonomous) | [memory.compression] compress_provider | Medium | Full compaction requires reasonable summarization quality |
| Complexity triage | [llm.complexity_routing] triage_provider | Fast / SLM | Single-word classification; any small model works |
| Graph entity extraction | [memory.graph] extract_provider | Fast / medium | NER + relation extraction; 8B models handle most cases |
| Session shutdown summary | [memory] summary_provider | Fast | Short session digest; latency is visible to the user |
| Orchestration planning | [orchestration] planner_provider | Quality / expert | Multi-step DAG planning requires high-capability models |
MCP tool discovery (Llm strategy) | [mcp.tool_discovery] | Fast / medium | Relevance ranking from a short list |
A typical cost-optimized setup uses a local Ollama model (e.g., qwen3:1.7b) for all fast-tier subsystems and a cloud model (e.g., claude-sonnet-4-6) for quality-tier tasks:
[[llm.providers]]
name = "fast"
type = "ollama"
model = "qwen3:1.7b"
embed = true
[[llm.providers]]
name = "quality"
type = "claude"
model = "claude-sonnet-4-6"
default = true
# Route cheap subsystems to the local model
[memory.admission]
admission_provider = "fast"
[memory.tiers]
scene_provider = "fast"
[memory.compression]
compress_provider = "fast"
[llm.complexity_routing]
triage_provider = "fast"
[orchestration]
planner_provider = "quality"
Hybrid Setup Example
Embeddings via free local Ollama, chat via paid Claude API:
[llm]
[[llm.providers]]
name = "ollama"
type = "ollama"
model = "qwen3:8b"
embedding_model = "qwen3-embedding"
embed = true
[[llm.providers]]
name = "claude"
type = "claude"
model = "claude-sonnet-4-6"
max_tokens = 4096
default = true