ShadowSentinel: AI Safety Probing

ShadowSentinel is a safety capability governance system that performs pre-execution LLM-based probes on high-risk tool categories before they run. It maintains a persistent audit trail of all safety events across sessions.

Phase 2 adds the SafetyProbe trait and ShadowProbeExecutor, enabling real-time safety classification with confidence scoring and bounded latency.

How It Works

Before executing a tool, ShadowSentinel asks the LLM: “Is this tool call safe to execute?” For high-risk tool categories (shell commands, file writes, exfil-capable MCP tools), the system:

Extracts tool metadata (name, arguments, category)
Sends a structured probe request to the configured LLM (probe_provider, or fallback to main provider)
Receives a confidence score (0.0–1.0) and reasoning
Logs the event to the persistent safety_shadow_events table
Allows execution if confidence > threshold (default: 0.5), or fails open if the probe times out

The probe is transparent — it runs before policy gates and doesn’t block concurrent tool execution.

Configuration

Enable ShadowSentinel in your config.toml:

[security.shadow_sentinel]
enabled = true                  # Enable safety probing (default: false)
probe_provider = "fast"         # Optional: provider name for probing (empty = main provider)
confidence_threshold = 0.5      # Minimum confidence for approval (0.0–1.0, default: 0.5)
max_probes_per_turn = 10        # Rate limit: max probes per agent turn (default: 10)
probe_timeout_ms = 5000         # Max time for one probe (default: 5000 ms)

Field	Type	Default	Description
`enabled`	bool	false	Enable ShadowSentinel (default: false)
`probe_provider`	string	“”	Provider name for probes; empty uses main provider
`confidence_threshold`	float	0.5	Approval threshold (0.0–1.0); higher = stricter
`max_probes_per_turn`	int	10	Rate limit per agent turn to prevent probe spam
`probe_timeout_ms`	int	5000	Max milliseconds per probe before timeout

Choosing a Probe Provider

Probes need to be fast and inexpensive — longer latency delays tool execution. Use a fast, cheap model:

[[llm.providers]]
name = "fast"
type = "openai"
model = "gpt-4o-mini"

[security.shadow_sentinel]
probe_provider = "fast"         # Use the cheap model for quick safety checks

When probe_provider is empty, ShadowSentinel falls back to the agent’s main provider.

Probed Tool Categories

ShadowSentinel probes are issued before executing these high-risk tool types:

Category	Examples	Risk Profile
Shell	`shell`, `bash`, `sh`	Arbitrary OS command execution
File Write	`write_file`, `edit_file`	Persistent state changes
Exfil-capable MCP	`brave_search`, `web_scrape`	Network access, data exfiltration

Tools in other categories (code execution, math, read-only file access) skip probing.

Safety Events Table

Every probe is logged to the safety_shadow_events SQLite table (created by migration 085) for audit and analysis:

CREATE TABLE safety_shadow_events (
  id INTEGER PRIMARY KEY,
  session_id TEXT NOT NULL,           -- Agent session ID
  timestamp TEXT NOT NULL,            -- Event timestamp (ISO 8601)
  tool_name TEXT NOT NULL,            -- Tool being probed (e.g., "shell")
  tool_args_preview TEXT,             -- First 512 chars of tool args
  probe_request_tokens INTEGER,       -- LLM input tokens
  probe_response_tokens INTEGER,      -- LLM output tokens
  confidence REAL,                    -- Safety confidence (0.0-1.0)
  reasoning TEXT,                     -- LLM's explanation
  result TEXT,                        -- 'approved', 'rejected', 'timeout'
  execution_allowed BOOLEAN           -- Whether execution proceeded
);

This allows you to:

Audit all safety decisions across all sessions
Analyze probe latency and cost
Review LLM reasoning for rejected calls
Detect patterns in tool usage and safety concerns

Querying Safety Events

# Count safety events by tool
sqlite3 ~/.zeph/zeph.db "SELECT tool_name, COUNT(*) FROM safety_shadow_events GROUP BY tool_name;"

# Find rejected probes
sqlite3 ~/.zeph/zeph.db "SELECT tool_name, confidence, reasoning FROM safety_shadow_events WHERE result = 'rejected';"

# Recent safety events
sqlite3 ~/.zeph/zeph.db "SELECT timestamp, tool_name, confidence, result FROM safety_shadow_events ORDER BY timestamp DESC LIMIT 20;"

Probe Behavior

Success (Confident Approval)

If the probe LLM returns confidence >= confidence_threshold, the tool executes immediately. The event is logged with result = 'approved' and execution_allowed = true.

Timeout

If the probe takes longer than probe_timeout_ms, ShadowSentinel fails open: the tool executes and the event is logged with result = 'timeout' and execution_allowed = true. This prevents slow probes from blocking operations.

Rate Limiting

At most max_probes_per_turn probes are issued per agent turn. If the limit is reached, subsequent tools skip probing for that turn. This prevents probe spam when many tools are called in a single step.

Rejection (Low Confidence)

When confidence < confidence_threshold, the tool does not execute. The event is logged with result = 'rejected' and execution_allowed = false. The agent receives a ToolError::SafetyCheckFailed result with the probe reasoning.

The agent can acknowledge the safety concern and retry, or choose a different approach.

Multi-Provider Safety (Optional)

For extra safety, probe with a different provider than the main inference engine:

[[llm.providers]]
name = "main"
type = "openai"
model = "gpt-4-turbo"

[[llm.providers]]
name = "safety-check"
type = "anthropic"
model = "claude-opus-4"

[security.shadow_sentinel]
probe_provider = "safety-check"     # Use Anthropic for safety, OpenAI for main inference

This creates an independent safety review layer using a different model/provider, reducing the chance of both falling into the same blind spots.

Disabling Probes for Specific Tools

There is no per-tool override for probing. If you trust certain tools completely and want to skip probing:

Recommendation: Keep probing enabled at the category level. The cost is low and the safety benefit is high.
Alternative: Disable ShadowSentinel entirely and rely on policy gates and permission checks.

Cost Considerations

Each probe:

Costs ~100 tokens prompt + ~50 tokens response (varies by tool complexity)
At $0.0001 per 1K tokens (typical cheap models), costs ~0.015¢ per probe
With max_probes_per_turn = 10, max cost per turn is ~0.15¢

For most workloads, probe overhead is negligible compared to main LLM inference.

Zeph Documentation