Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

zeph-experiments

Autonomous experiment engine for adaptive agent behavior testing and hyperparameter tuning.

Extracted from zeph-core in epic #1973 (Phase 1d). Gated behind the experiments feature flag.

Purpose

zeph-experiments implements a closed-loop system that automatically tests agent behavior variations and selects configurations that maximize LLM-judged quality. It is used by the agent’s self-improvement loop to discover better hyperparameters (temperature, context budget, skill prompt mode, etc.) without human intervention.

The engine operates on a search space of discrete and continuous parameter ranges. It explores the space using three strategies: grid search, random sampling, and neighborhood (hill-climbing). For each variation it runs a set of benchmark cases, scores them with an LLM judge, and persists the results.

Key Types

TypeDescription
ExperimentEngineTop-level orchestrator: runs a full experiment session, writes snapshots, returns a report
ExperimentSessionReportSession summary: best variation found, score delta, number of cases run
SearchSpaceDefines the hyperparameter ranges to explore (ParameterRange per parameter)
ParameterRangeSingle dimension: Float(min, max, step) or Enum(Vec<String>)
VariationGeneratorTrait implemented by GridStep, Random, Neighborhood — produces candidate variations
GridStepSystematic grid traversal over the search space
RandomRandom sampling using a SmallRng for reproducible runs
NeighborhoodHill-climbing: perturb the current best by one step in each dimension
EvaluatorRuns benchmark cases against the agent using a variation’s config, scores with JudgeOutput
BenchmarkSetCollection of BenchmarkCase entries: prompt + expected behavior description
BenchmarkCaseSingle test: input prompt and a human-readable quality criterion
EvalReportAggregated scores across all cases for a single variation
CaseScorePer-case score (0.0–1.0) with judge rationale
ConfigSnapshotSerializable snapshot of the current agent config used as the experiment baseline
GenerationOverridesDelta overrides applied on top of ConfigSnapshot for a variation
ExperimentResultPersisted result record: variation, score, timestamp, session ID
EvalErrorTyped error enum for evaluation failures

Search Strategies

Grid Search (GridStep)

Exhaustively iterates over the Cartesian product of all parameter ranges. Suitable for small search spaces (e.g., 3 temperature values × 2 skill modes = 6 candidates).

Random Sampling (Random)

Samples parameter combinations uniformly at random. Efficient for large search spaces where exhaustive search is too slow.

Neighborhood / Hill-Climbing (Neighborhood)

Starts from the current best variation and generates all single-parameter perturbations. Runs those candidates, adopts the best as the new starting point, and repeats. Converges quickly but may find local optima.

Feature Flag

All modules in zeph-experiments are gated behind #[cfg(feature = "experiments")]. The crate compiles to an empty library when the feature is off.

To enable:

# root Cargo.toml (or workspace member)
[features]
experiments = ["zeph-experiments/experiments"]

Or build with the full or experiments feature:

cargo build --features experiments

Integration with zeph-core

When the experiments feature is enabled, zeph-core constructs an ExperimentEngine from ExperimentConfig during AppBuilder::build(). The engine is scheduled via zeph-scheduler for periodic automated runs (when both experiments and scheduler features are active).

# config.toml
[experiments]
enabled = true
schedule = "0 3 * * *"   # Run at 03:00 every night
cases_per_run = 10

The agent exposes /experiments TUI commands to manually trigger runs and inspect results.

Benchmark Dataset

BenchmarkSet is loaded from TOML files in the skills directory or defined inline in the config. Each case contains a prompt and a quality criterion string that the LLM judge uses to score the agent’s response.

# Example benchmark case
[[experiments.cases]]
prompt = "Summarize the last three git commits in one sentence."
criterion = "The summary must mention commit count and be a single sentence."

LLM-as-Judge

The Evaluator sends each (prompt, response) pair to an LLM along with the quality criterion and asks it to return a JudgeOutput with a score (0.0–1.0) and a brief rationale. The judge model is typically a small, fast model separate from the agent’s main provider.

#![allow(unused)]
fn main() {
// JudgeOutput schema (simplified)
struct JudgeOutput {
    score: f64,       // 0.0 = fail, 1.0 = perfect
    rationale: String,
}
}

Source Code

crates/zeph-experiments/

See Also