Experiments
The experiments engine lets Zeph autonomously tune its own configuration by running controlled A/B trials against a benchmark. Inspired by karpathy/autoresearch, it varies a single parameter at a time, evaluates both baseline and candidate responses using an LLM-as-judge, and keeps the variation only if the candidate scores higher. This is an optional, feature-gated component (--features experiments) that persists results in SQLite.
Prerequisites
Enable the experiments feature flag before building:
cargo build --release --features experiments
The experiments feature is also included in the full feature set:
cargo build --release --features full
See Feature Flags for the full flag list.
How It Works
Each experiment session follows a four-step loop:
- Select a parameter — pick one tunable parameter (e.g.,
temperature,top_p,retrieval_top_k) and generate a candidate value. - Run baseline — send a benchmark prompt with the current configuration and record the response.
- Run candidate — send the same prompt with the varied parameter and record the response.
- Judge — an LLM evaluator scores both responses on a numeric scale. If the candidate exceeds the baseline by at least
min_improvement, the variation is accepted; otherwise it is reverted.
The engine repeats this loop up to max_experiments times per session, staying within max_wall_time_secs and eval_budget_tokens limits.
Tunable Parameters
The engine can vary the following parameters:
| Parameter | Type | Description |
|---|---|---|
temperature | float | LLM sampling temperature |
top_p | float | Nucleus sampling threshold |
top_k | int | Top-K sampling limit |
frequency_penalty | float | Penalize repeated tokens |
presence_penalty | float | Penalize tokens already present |
retrieval_top_k | int | Number of memory results to retrieve |
similarity_threshold | float | Minimum similarity for memory recall |
temporal_decay | float | Weight decay for older memories |
Search Space
The search space defines the bounds and resolution for each tunable parameter. It is represented by a SearchSpace containing a list of ParameterRange entries.
Each ParameterRange specifies:
| Field | Type | Description |
|---|---|---|
kind | ParameterKind | Which parameter this range controls |
min | f64 | Lower bound of the range |
max | f64 | Upper bound of the range |
step | Option<f64> | Discrete step size for grid and quantization. None means continuous |
default | f64 | Default value used as the baseline starting point |
The default search space covers five LLM generation parameters:
| Parameter | Min | Max | Step | Default |
|---|---|---|---|---|
temperature | 0.0 | 1.0 | 0.1 | 0.7 |
top_p | 0.1 | 1.0 | 0.05 | 0.9 |
top_k | 1 | 100 | 5 | 40 |
frequency_penalty | -2.0 | 2.0 | 0.2 | 0.0 |
presence_penalty | -2.0 | 2.0 | 0.2 | 0.0 |
You can customize the search space by adding or removing parameters. The remaining tunable parameters (retrieval_top_k, similarity_threshold, temporal_decay) are not included in the default space but can be added manually.
Config Snapshot
A ConfigSnapshot captures the values of all tunable parameters for a single experiment arm. It serves as the bridge between the runtime configuration and the variation engine.
- The baseline snapshot is created from the current
ConfigviaConfigSnapshot::from_config. - Each variation produces a new snapshot with exactly one parameter changed (
snapshot.apply(&variation)). - The
diffmethod compares two snapshots and returns the singleVariationthat differs, orNoneif zero or more than one parameter changed.
Snapshots also provide to_generation_overrides() to extract LLM-relevant parameters for use during evaluation.
Variation Strategies
The variation engine uses a VariationGenerator trait to produce candidate parameter values. Each call to next() returns a Variation that changes exactly one parameter from the baseline. This one-at-a-time constraint isolates the effect of each change, making it possible to attribute score differences to a specific parameter.
All strategies track visited variations via a HashSet<Variation> to avoid re-testing the same configuration. Floating-point values use OrderedFloat for reliable hashing and equality.
Grid
GridStep performs a systematic sweep of every parameter through its discrete steps from min to max. Parameters are swept one at a time: all grid points for the first parameter are enumerated before moving to the next. Already-visited variations are skipped. Returns None when the full grid has been covered.
Grid is the default starting strategy. It provides complete coverage of the discrete search space and is deterministic (no randomness involved). Values are quantized to the nearest step to avoid floating-point accumulation errors.
Random
Random samples uniformly within each parameter’s bounds. At each call, it picks a random parameter, samples a random value from its [min, max] range, and quantizes to the nearest step. The sample is rejected if already visited. After 1000 consecutive rejections, the space is considered exhausted.
Random sampling is seeded (SmallRng::seed_from_u64) for reproducibility. It is useful when the grid is too large to sweep exhaustively or when you want to explore the space without systematic bias.
Neighborhood
Neighborhood perturbs the current best configuration by a small amount. At each call, it picks a random parameter and computes a new value as baseline ± U(-radius, radius) * step, then clamps and quantizes the result. This focuses exploration around a known-good region.
Neighborhood is most useful as a refinement step after a grid or random sweep has identified a promising baseline. The radius parameter (must be positive) controls the perturbation range in units of step. For example, radius = 1.0 with step = 0.1 means perturbations of at most ±0.1 from the baseline value.
Strategy Selection
Choose a strategy based on your goals:
| Strategy | Best for | Deterministic | Coverage |
|---|---|---|---|
| Grid | Small search spaces, complete coverage | Yes | Exhaustive |
| Random | Large spaces, quick exploration | Seeded | Stochastic |
| Neighborhood | Refinement around a known-good config | Seeded | Local |
A typical workflow combines strategies across sessions: start with Grid or Random to identify promising regions, then switch to Neighborhood for fine-tuning.
Benchmark Dataset
A benchmark dataset is a TOML file containing a list of test cases. Each case defines a prompt to send to the subject model, with optional context, reference answer, and tags.
[[cases]]
prompt = "Explain the difference between TCP and UDP"
tags = ["knowledge", "networking"]
[[cases]]
prompt = "Write a Python function to find the longest palindromic substring"
reference = "Dynamic programming approach with O(n^2) time"
tags = ["coding", "algorithms"]
[[cases]]
prompt = "Summarize the key ideas of the transformer architecture"
context = "The transformer was introduced in 'Attention Is All You Need' (2017)..."
tags = ["knowledge", "ml"]
Case Fields
| Field | Type | Required | Description |
|---|---|---|---|
prompt | string | yes | The prompt sent to the subject model |
context | string | no | System context injected before the prompt |
reference | string | no | Reference answer the judge uses to calibrate scoring |
tags | string array | no | Labels for filtering or grouping in reports |
Load a dataset from disk with BenchmarkSet::from_file:
#![allow(unused)]
fn main() {
use std::path::Path;
use zeph_core::experiments::BenchmarkSet;
let dataset = BenchmarkSet::from_file(Path::new("benchmarks/default.toml"))?;
dataset.validate()?; // rejects empty case lists
}
LLM-as-Judge Evaluator
The Evaluator scores a subject model’s responses by sending each one to a separate judge model. The judge rates responses on a 1–10 scale across four weighted criteria:
| Criterion | Weight |
|---|---|
| Accuracy | 30% |
| Completeness | 25% |
| Clarity | 25% |
| Relevance | 20% |
The judge returns structured JSON output (JudgeOutput) containing a numeric score and a one-sentence justification.
Evaluation Flow
- Subject calls – the evaluator sends each benchmark case to the subject model sequentially, collecting responses.
- Judge calls – responses are scored in parallel (up to
parallel_evalsconcurrent tasks, default 3) using a separate judge model. - Budget check – before each judge call, the evaluator checks cumulative token usage against the configured budget. If the budget is exhausted, remaining cases are skipped.
- Report – per-case scores are aggregated into an
EvalReport.
Security
Subject responses are wrapped in <subject_response> XML boundary tags before being sent to the judge. XML metacharacters (&, <, >) in the response and reference fields are escaped to prevent prompt injection from the evaluated model.
Creating an Evaluator
#![allow(unused)]
fn main() {
use std::sync::Arc;
use zeph_core::experiments::{BenchmarkSet, Evaluator};
use zeph_llm::any::AnyProvider;
fn example(judge: Arc<AnyProvider>, subject: &AnyProvider, benchmark: BenchmarkSet) {
let evaluator = Evaluator::new(
judge, // judge model provider
benchmark, // loaded benchmark dataset
100_000, // token budget for all judge calls
)?
.with_parallel_evals(5); // override default concurrency (3)
}
}
Run the evaluation:
#![allow(unused)]
fn main() {
use zeph_core::experiments::Evaluator;
use zeph_llm::any::AnyProvider;
async fn example(evaluator: &Evaluator, subject: &AnyProvider) {
let report = evaluator.evaluate(subject).await?;
println!("Mean score: {:.1}/10 ({} of {} cases)",
report.mean_score, report.cases_scored, report.cases_total);
}
}
Evaluation Report
EvalReport contains aggregate metrics and per-case detail:
| Field | Type | Description |
|---|---|---|
mean_score | f64 | Mean score across scored cases (NaN if none succeeded) |
p50_latency_ms | u64 | Median latency of judge calls |
p95_latency_ms | u64 | 95th-percentile latency of judge calls |
total_tokens | u64 | Total tokens consumed by judge calls |
cases_scored | usize | Number of successfully scored cases |
cases_total | usize | Total cases in the benchmark set |
is_partial | bool | True if budget was exceeded or errors occurred |
error_count | usize | Number of failed cases (LLM error, parse error, or budget) |
per_case | Vec<CaseScore> | Per-case scores ordered by case index |
Each CaseScore entry contains:
| Field | Type | Description |
|---|---|---|
case_index | usize | Zero-based index into the benchmark cases |
score | f64 | Clamped score in [1.0, 10.0] |
reason | String | Judge’s one-sentence justification |
latency_ms | u64 | Wall-clock time for the judge call |
tokens | u64 | Tokens consumed by this judge call |
Budget Enforcement
The evaluator tracks cumulative token usage across all judge calls with an atomic counter. Before each judge call, the current total is checked against the configured budget_tokens. If the budget is exhausted:
- The current batch of in-flight judge calls is drained
- Remaining cases are excluded from scoring
- The report is marked as partial (
is_partial = true)
Budget exhaustion is not a fatal error – the evaluator returns a valid EvalReport with partial results.
Parallel Evaluation
Judge calls run concurrently using FuturesUnordered with a Semaphore controlling the maximum number of in-flight requests. The default concurrency limit is 3 and can be overridden with with_parallel_evals. Subject calls remain sequential to avoid overwhelming the subject model.
Each parallel judge task receives a cloned provider instance so per-task token usage tracking is isolated. The shared atomic token counter aggregates usage across all tasks for budget enforcement.
Safety Model
The experiments engine uses a conservative, double opt-in design:
- Feature gate — the
experimentsfeature must be compiled in. It is off by default. - Config gate —
enabled = truemust be set in[experiments]. Default isfalse. - No auto-apply —
auto_applydefaults tofalse. When disabled, accepted variations are recorded but not written back to the live configuration. Set totrueonly when you want the agent to self-tune in production. - Budget limits —
max_experiments,max_wall_time_secs, andeval_budget_tokenscap resource usage per session. - Sandboxed scope — experiments only vary inference and retrieval parameters. They cannot modify tool permissions, security settings, or system prompts.
Configuration
Add an [experiments] section to config.toml:
[experiments]
enabled = true
# eval_model = "claude-sonnet-4-20250514" # Model for LLM-as-judge evaluation (default: agent's model)
# benchmark_file = "benchmarks/eval.toml" # Prompt set for A/B comparison
max_experiments = 20 # Max variations per session (default: 20, range: 1-1000)
max_wall_time_secs = 3600 # Wall-clock budget per session in seconds (default: 3600, range: 60-86400)
min_improvement = 0.5 # Minimum score delta to accept a variation (default: 0.5, range: 0.0-100.0)
eval_budget_tokens = 100000 # Token budget for all judge calls in a session (default: 100000, range: 1000-10000000)
auto_apply = false # Write accepted variations to live config (default: false)
[experiments.schedule]
enabled = false # Enable cron-based automatic runs (default: false)
cron = "0 3 * * *" # Cron expression for scheduled runs (default: daily at 03:00)
max_experiments_per_run = 20 # Max variations per scheduled run (default: 20, range: 1-100)
max_wall_time_secs = 1800 # Wall-time cap per scheduled run in seconds (default: 1800, range: 60-86400)
Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Master switch for the experiments engine |
eval_model | string | agent’s model | Model used for LLM-as-judge scoring |
benchmark_file | path | none | Path to a TOML file with evaluation prompts |
max_experiments | u32 | 20 | Maximum variations per session |
max_wall_time_secs | u64 | 3600 | Wall-clock time limit per session |
min_improvement | f64 | 0.5 | Minimum score delta to accept a variation |
eval_budget_tokens | u64 | 100000 | Token budget across all judge calls |
auto_apply | bool | false | Apply accepted variations to live config |
schedule.enabled | bool | false | Enable automatic scheduled experiment runs |
schedule.cron | string | "0 3 * * *" | Cron expression (5-field) for scheduled runs |
schedule.max_experiments_per_run | u32 | 20 | Cap per scheduled run |
schedule.max_wall_time_secs | u64 | 1800 | Wall-time cap per scheduled run (overrides max_wall_time_secs) |
Persistence
Experiment results are stored in the experiment_results SQLite table (same database as memory). Each row tracks:
session_id— groups results from a single experiment runparameter— which parameter was varied (e.g.,temperature)value_json— the candidate value as JSONbaseline_score/candidate_score— numeric scores from the judgedelta— score difference (candidate minus baseline)latency_ms— wall-clock time for the trialtokens_used— tokens consumed by the judge callaccepted— whether the variation met themin_improvementthresholdsource—manualorscheduled
Error Handling
| Error | Cause | Effect |
|---|---|---|
BenchmarkLoad | File not found or unreadable | Evaluator construction fails |
BenchmarkParse | Invalid TOML syntax | Evaluator construction fails |
EmptyBenchmarkSet | No cases in the dataset | Evaluator construction fails |
PathTraversal | Benchmark path escapes allowed directory | Evaluator construction fails |
BenchmarkTooLarge | Benchmark file exceeds 10 MiB | Evaluator construction fails |
Llm | Subject model call fails | Evaluation aborts (fatal) |
JudgeParse | Judge returns invalid or non-finite score | Case excluded, logged as warning |
BudgetExceeded | Token budget exhausted | Remaining cases skipped, partial report returned |
Scheduler Integration
When both experiments and scheduler features are enabled, the experiment engine can run automatically on a cron schedule. This is configured via the [experiments.schedule] section.
How It Works
- At startup, if
experiments.enabledandexperiments.schedule.enabledare bothtrue, the scheduler registers anauto-experimentperiodic task with the configured cron expression. - When the cron fires, an
ExperimentTaskHandlerspawns a non-blockingtokio::spawntask that runs a full experiment session. - An
AtomicBoolrunning guard prevents overlapping sessions. If a previous session is still in progress when the next cron trigger fires, the new run is skipped with a warning log. - Scheduled runs use
ExperimentSource::Scheduledtagging so results can be distinguished from manual runs in the persistence layer (thesourcecolumn inexperiment_results). - The
schedule.max_wall_time_secsfield (default: 1800s) overrides the top-levelmax_wall_time_secsfor scheduled runs, ensuring background sessions finish before the next cron trigger on typical schedules.
Requirements
- Both
experimentsandschedulerfeature flags must be compiled in. - A valid
benchmark_filemust be configured (the handler loads the benchmark set on each run). - The agent’s LLM provider must be available for both subject and judge calls.
Task Kind
The scheduler uses a dedicated TaskKind::Experiment variant (kind string: "experiment"). This can also be used in [[scheduler.tasks]] config entries, though the [experiments.schedule] section is the recommended way to configure automatic runs.
CLI Flags
Two flags provide headless experiment access (requires experiments feature):
| Flag | Description |
|---|---|
--experiment-run | Run a single experiment session and exit. Loads the benchmark file, creates a provider for both subject and judge roles, runs the full experiment loop, and prints a summary before exiting. |
--experiment-report | Print a summary of past experiment results and exit. Reads directly from the SQLite store without starting an LLM provider. |
Both flags cause the process to exit after completion — they do not start the interactive agent loop.
# Run a one-shot experiment session
zeph --experiment-run --config config.toml
# View past results
zeph --experiment-report
See CLI Reference for the full flag list.
TUI Commands
The following /experiment commands are available in the TUI dashboard:
| Command | Description |
|---|---|
/experiment start [N] | Start a new experiment session. Optional N overrides max_experiments for this run. |
/experiment stop | Cancel the running session gracefully via CancellationToken. Partial results are preserved. |
/experiment status | Show progress of the current session (experiment count, accepted count, elapsed time). |
/experiment report | Display results from past sessions stored in SQLite. |
/experiment best | Show the best accepted variation per parameter across all sessions. |
Only one experiment session can run at a time. Starting a new session while one is already running returns an error message. The TUI displays a spinner with status updates during experiment execution.
Init Wizard
The zeph init wizard includes an experiments step (after the scheduler section). It prompts:
- Enable autonomous experiments — master switch (
enabledfield, default: no). - Judge model — model used for LLM-as-judge evaluation (
eval_model, default:claude-sonnet-4-20250514). - Schedule automatic runs — enable cron-based experiment sessions (
schedule.enabled, default: no). - Cron schedule — 5-field cron expression (
schedule.cron, default:0 3 * * *).
The wizard generates the corresponding [experiments] and [experiments.schedule] sections in the output config file. The ExperimentConfig struct is always compiled (not feature-gated), so the wizard step is available regardless of the experiments feature flag.
See Configuration Wizard for the full wizard walkthrough.
Related
- Scheduler — cron-based task scheduler that drives automatic experiment runs
- Daemon & Scheduler — running the scheduler alongside the gateway and A2A server
- Self-Learning Skills — passive feedback detection and Wilson score ranking
- Model Orchestrator — multi-model routing and fallback chains
- Feature Flags — enabling the
experimentsfeature - Configuration — full config reference
- Adaptive Inference — runtime model routing that experiments can tune