SkillOrchestra: RL-Based Skill Routing
SkillOrchestra adds a reinforcement learning routing head on top of the standard BM25+cosine skill matcher. It learns from execution outcomes to adjust skill selection probabilities, preferring skills that succeed for a given query type over time.
How It Works
The standard skill matcher selects the top-K skills by semantic similarity. SkillOrchestra wraps this with a contextual bandit algorithm (LinUCB) that re-ranks candidates based on historical outcomes:
User query
|
v
BM25 + Cosine matcher --> top-K candidates
|
v
SkillOrchestra RL head --> re-ranked candidates
|
v
Top skill injected into prompt
After each skill execution, the outcome (success/failure) is fed back to the RL model as a reward signal. Over time, the model learns which skills work best for which types of queries, even when multiple skills have similar embeddings.
Cold Start
When SkillOrchestra has insufficient observations for a query type, it falls back to the standard BM25+cosine ranking. The transition from cold-start to RL-guided routing is gradual — the RL head’s confidence increases as observations accumulate, and its influence on the final ranking scales accordingly.
Configuration
[skills]
rl_routing_enabled = true # Enable RL-based skill routing (default: false)
SkillOrchestra requires [skills.learning] enabled = true to collect outcome data. Without the learning system, there are no reward signals to train on.
RL Routing Configuration
The SkillOrchestra routing head is a linear layer that takes a query embedding as input and produces a score for each skill candidate. Scores are blended with cosine similarity via rl_weight. Weights are updated via REINFORCE after each observed outcome and persisted to SQLite every rl_persist_interval updates.
Thompson Sampling / RL update cycle:
- At match time, cosine similarity candidates are re-ranked using the linear head’s predicted scores.
- The blend formula is:
final_score = (1 - rl_weight) * cosine + rl_weight * rl_score. - After execution, the outcome (success = 1.0, failure = 0.0) is used as the REINFORCE reward to update the head weights.
- For the first
rl_warmup_updatesweight updates, the RL score is not blended — the routing head observes outcomes but does not influence selection. This prevents cold-start bias.
Enable RL routing only after the agent has accumulated at least 50 turns of skill usage so the warmup phase completes quickly and the head has enough signal to learn meaningful routing patterns.
[skills]
rl_routing_enabled = true # Enable RL routing head (default: false)
rl_learning_rate = 0.01 # REINFORCE weight update step size (default: 0.01)
rl_weight = 0.3 # Blend: (1-rl_weight)*cosine + rl_weight*rl_score (default: 0.3)
rl_persist_interval = 10 # Persist weights every N updates; 0 = every update (default: 10)
rl_warmup_updates = 50 # Updates before RL score influences ranking (default: 50)
rl_embed_dim = 768 # Must match embedding provider output dim; None → 1536 (default: null)
Important
rl_embed_dimmust match the vector dimension produced by your embedding provider. Mismatches cause a dim mismatch error at startup and the routing head falls back to cosine-only ranking. For Ollama providers usingnomic-embed-textor similar 768-dim models, setrl_embed_dim = 768. For OpenAItext-embedding-3-small, setrl_embed_dim = 1536.
When to Enable
Enable SkillOrchestra when:
- You have 10+ skills with overlapping descriptions that confuse the cosine matcher
- Skills with similar embeddings have different success rates for different query types
- You run Zeph over extended periods and want skill selection to improve automatically
Do not enable it for small skill sets (<5 skills) or short-lived sessions where the RL model cannot accumulate enough observations.
Interaction with Other Systems
- D2Skill: D2Skill corrects individual steps within a skill; SkillOrchestra selects which skill to use in the first place. They operate at different levels and complement each other.
- Wilson Score: Wilson scores measure per-skill reliability. SkillOrchestra uses them as a feature in the bandit model alongside query-skill similarity and historical outcome patterns.
- Hybrid Search: SkillOrchestra operates after BM25+cosine fusion. It does not replace hybrid search — it re-ranks its output.
Monitoring
Use /skill stats to see RL routing metrics alongside Wilson scores:
/skill stats
The output includes the RL exploration rate and per-skill selection counts when SkillOrchestra is active.
Next Steps
- Self-Learning Skills — the full learning pipeline
- Skills — how skill matching works
- Enable Self-Learning Skills — setup guide