Local Inference (Candle)
Run HuggingFace GGUF models locally via candle without external API dependencies. Metal and CUDA GPU acceleration are supported.
cargo build --release --features candle,metal # macOS with Metal GPU
Configuration
[llm]
provider = "candle"
[llm.candle]
source = "huggingface"
repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
chat_template = "mistral" # llama3, chatml, mistral, phi3, raw
embedding_repo = "sentence-transformers/all-MiniLM-L6-v2" # optional BERT embeddings
[llm.candle.generation]
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
repeat_penalty = 1.1
Chat Templates
| Template | Models |
|---|---|
llama3 | Llama 3, Llama 3.1 |
chatml | Qwen, Yi, OpenHermes |
mistral | Mistral, Mixtral |
phi3 | Phi-3 |
raw | No template (raw completion) |
Device Auto-Detection
- macOS — Metal GPU (requires
--features metal) - Linux with NVIDIA — CUDA (requires
--features cuda) - Fallback — CPU
Candle-Backed Classifiers
When built with the classifiers feature, Zeph uses Candle to run DeBERTa-based models directly for injection detection and PII detection — no external API calls required.
Injection Detection (CandleClassifier)
CandleClassifier runs protectai/deberta-v3-small-prompt-injection-v2 (sequence classification) to detect prompt injection attempts in incoming messages. When the model scores above injection_threshold, the message is flagged and existing injection-handling logic applies.
Long inputs are split into overlapping chunks (448 tokens each, 64-token overlap). The final score is the maximum across all chunks.
PII Detection (CandlePiiClassifier)
CandlePiiClassifier runs iiiorg/piiranha-v1-detect-personal-information (NER token classification) to detect personal information in messages. Detected spans are merged with the existing regex-based PII filter — the union of both result sets is used.
Per-token confidence below pii_threshold is treated as O (no entity). Entity types include: GIVENNAME, EMAIL, PHONE, DRIVERLICENSE, PASSPORT, IBAN, and others as defined by the model.
Configuration
[classifiers]
enabled = true # Master switch (default: false)
timeout_ms = 5000 # Per-inference timeout in ms (default: 5000)
injection_model = "protectai/deberta-v3-small-prompt-injection-v2"
injection_threshold = 0.8 # Minimum score to classify as injection (default: 0.8)
# injection_model_sha256 = "abc123..." # Optional: verify model file integrity at load
pii_enabled = true # Enable NER PII detection (default: false)
pii_model = "iiiorg/piiranha-v1-detect-personal-information"
pii_threshold = 0.75 # Minimum per-token confidence (default: 0.75)
# pii_model_sha256 = "def456..." # Optional: verify model file integrity at load
SHA-256 verification: Set injection_model_sha256 or pii_model_sha256 to the hex digest of the model’s safetensors file. Zeph verifies the file before loading and aborts startup on mismatch. Use this in security-sensitive deployments to detect corruption or tampering.
Timeout fallback: When an inference call exceeds timeout_ms, Zeph falls back to the existing regex-based detection. Classifiers never block the agent — degraded mode is always available.
Model download: Models are downloaded from HuggingFace on first use and cached locally. Subsequent startups load from cache. Set injection_model / pii_model to a custom HuggingFace repo ID to use alternative models with the same DeBERTa architecture.