ShadowSentinel: AI Safety Probing
ShadowSentinel is a safety capability governance system that performs pre-execution LLM-based probes on high-risk tool categories before they run. It maintains a persistent audit trail of all safety events across sessions.
Phase 2 adds the SafetyProbe trait and ShadowProbeExecutor, enabling real-time safety classification with confidence scoring and bounded latency.
How It Works
Before executing a tool, ShadowSentinel asks the LLM: “Is this tool call safe to execute?” For high-risk tool categories (shell commands, file writes, exfil-capable MCP tools), the system:
- Extracts tool metadata (name, arguments, category)
- Sends a structured probe request to the configured LLM (
probe_provider, or fallback to main provider) - Receives a confidence score (0.0–1.0) and reasoning
- Logs the event to the persistent
safety_shadow_eventstable - Allows execution if confidence > threshold (default: 0.5), or fails open if the probe times out
The probe is transparent — it runs before policy gates and doesn’t block concurrent tool execution.
Configuration
Enable ShadowSentinel in your config.toml:
[security.shadow_sentinel]
enabled = true # Enable safety probing (default: false)
probe_provider = "fast" # Optional: provider name for probing (empty = main provider)
confidence_threshold = 0.5 # Minimum confidence for approval (0.0–1.0, default: 0.5)
max_probes_per_turn = 10 # Rate limit: max probes per agent turn (default: 10)
probe_timeout_ms = 5000 # Max time for one probe (default: 5000 ms)
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable ShadowSentinel (default: false) |
probe_provider | string | “” | Provider name for probes; empty uses main provider |
confidence_threshold | float | 0.5 | Approval threshold (0.0–1.0); higher = stricter |
max_probes_per_turn | int | 10 | Rate limit per agent turn to prevent probe spam |
probe_timeout_ms | int | 5000 | Max milliseconds per probe before timeout |
Choosing a Probe Provider
Probes need to be fast and inexpensive — longer latency delays tool execution. Use a fast, cheap model:
[[llm.providers]]
name = "fast"
type = "openai"
model = "gpt-4o-mini"
[security.shadow_sentinel]
probe_provider = "fast" # Use the cheap model for quick safety checks
When probe_provider is empty, ShadowSentinel falls back to the agent’s main provider.
Probed Tool Categories
ShadowSentinel probes are issued before executing these high-risk tool types:
| Category | Examples | Risk Profile |
|---|---|---|
| Shell | shell, bash, sh | Arbitrary OS command execution |
| File Write | write_file, edit_file | Persistent state changes |
| Exfil-capable MCP | brave_search, web_scrape | Network access, data exfiltration |
Tools in other categories (code execution, math, read-only file access) skip probing.
Safety Events Table
Every probe is logged to the safety_shadow_events SQLite table (created by migration 085) for audit and analysis:
CREATE TABLE safety_shadow_events (
id INTEGER PRIMARY KEY,
session_id TEXT NOT NULL, -- Agent session ID
timestamp TEXT NOT NULL, -- Event timestamp (ISO 8601)
tool_name TEXT NOT NULL, -- Tool being probed (e.g., "shell")
tool_args_preview TEXT, -- First 512 chars of tool args
probe_request_tokens INTEGER, -- LLM input tokens
probe_response_tokens INTEGER, -- LLM output tokens
confidence REAL, -- Safety confidence (0.0-1.0)
reasoning TEXT, -- LLM's explanation
result TEXT, -- 'approved', 'rejected', 'timeout'
execution_allowed BOOLEAN -- Whether execution proceeded
);
This allows you to:
- Audit all safety decisions across all sessions
- Analyze probe latency and cost
- Review LLM reasoning for rejected calls
- Detect patterns in tool usage and safety concerns
Querying Safety Events
# Count safety events by tool
sqlite3 ~/.zeph/zeph.db "SELECT tool_name, COUNT(*) FROM safety_shadow_events GROUP BY tool_name;"
# Find rejected probes
sqlite3 ~/.zeph/zeph.db "SELECT tool_name, confidence, reasoning FROM safety_shadow_events WHERE result = 'rejected';"
# Recent safety events
sqlite3 ~/.zeph/zeph.db "SELECT timestamp, tool_name, confidence, result FROM safety_shadow_events ORDER BY timestamp DESC LIMIT 20;"
Probe Behavior
Success (Confident Approval)
If the probe LLM returns confidence >= confidence_threshold, the tool executes immediately. The event is logged with result = 'approved' and execution_allowed = true.
Timeout
If the probe takes longer than probe_timeout_ms, ShadowSentinel fails open: the tool executes and the event is logged with result = 'timeout' and execution_allowed = true. This prevents slow probes from blocking operations.
Rate Limiting
At most max_probes_per_turn probes are issued per agent turn. If the limit is reached, subsequent tools skip probing for that turn. This prevents probe spam when many tools are called in a single step.
Rejection (Low Confidence)
When confidence < confidence_threshold, the tool does not execute. The event is logged with result = 'rejected' and execution_allowed = false. The agent receives a ToolError::SafetyCheckFailed result with the probe reasoning.
The agent can acknowledge the safety concern and retry, or choose a different approach.
Multi-Provider Safety (Optional)
For extra safety, probe with a different provider than the main inference engine:
[[llm.providers]]
name = "main"
type = "openai"
model = "gpt-4-turbo"
[[llm.providers]]
name = "safety-check"
type = "anthropic"
model = "claude-opus-4"
[security.shadow_sentinel]
probe_provider = "safety-check" # Use Anthropic for safety, OpenAI for main inference
This creates an independent safety review layer using a different model/provider, reducing the chance of both falling into the same blind spots.
Disabling Probes for Specific Tools
There is no per-tool override for probing. If you trust certain tools completely and want to skip probing:
- Recommendation: Keep probing enabled at the category level. The cost is low and the safety benefit is high.
- Alternative: Disable ShadowSentinel entirely and rely on policy gates and permission checks.
Cost Considerations
Each probe:
- Costs ~100 tokens prompt + ~50 tokens response (varies by tool complexity)
- At $0.0001 per 1K tokens (typical cheap models), costs ~0.015¢ per probe
- With
max_probes_per_turn = 10, max cost per turn is ~0.15¢
For most workloads, probe overhead is negligible compared to main LLM inference.
See Also
- Skill Trust & Security — Policy enforcement and permission models
- File Read Sandbox — Sandboxed file access restrictions
- MCP Security — MCP server vetting and privilege isolation