Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Audio and Vision

Zeph supports audio transcription and image input across all channels.

Audio Input

Pipeline: Audio attachment → STT provider → Transcribed text → Agent loop

Configuration

Enable the stt feature flag:

cargo build --release --features stt
[llm.stt]
provider = "whisper"
model = "whisper-1"

When base_url is omitted, the provider uses the OpenAI API key from [llm.openai] or ZEPH_OPENAI_API_KEY. Set base_url to point at any OpenAI-compatible server (no API key required for local servers). The language field accepts an ISO-639-1 code (e.g. ru, en, de) or auto for automatic detection.

Environment variable overrides: ZEPH_STT_PROVIDER, ZEPH_STT_MODEL, ZEPH_STT_LANGUAGE, ZEPH_STT_BASE_URL.

Backends

BackendProviderFeatureDescription
OpenAI Whisper APIwhispersttCloud-based transcription
OpenAI-compatible serverwhispersttAny local server with /v1/audio/transcriptions
Local Whispercandle-whispercandleFully offline via candle

Local Whisper Server (whisper.cpp)

The recommended setup for local speech-to-text. Uses Metal acceleration on Apple Silicon and handles all audio formats (including Telegram OGG/Opus) server-side.

Install and run:

brew install whisper-cpp

# Download a model
curl -L -o ~/.cache/whisper/ggml-large-v3.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3.bin

# Start the server
whisper-server \
  --model ~/.cache/whisper/ggml-large-v3.bin \
  --host 127.0.0.1 --port 8080 \
  --inference-path "/v1/audio/transcriptions" \
  --convert

Configure Zeph:

[llm.stt]
provider = "whisper"
model = "large-v3"
base_url = "http://127.0.0.1:8080/v1"
language = "en"   # ISO-639-1 code or "auto"
ModelParametersDiskNotes
ggml-tiny39M~75 MBFastest, lower accuracy
ggml-base74M~142 MBGood balance
ggml-small244M~466 MBBetter accuracy
ggml-large-v31.5B~2.9 GBBest accuracy

Local Whisper (Candle)

cargo build --release --features candle   # CPU
cargo build --release --features metal    # macOS Metal GPU
cargo build --release --features cuda     # NVIDIA GPU
[llm.stt]
provider = "candle-whisper"
model = "openai/whisper-tiny"
ModelParametersDisk
openai/whisper-tiny39M~150 MB
openai/whisper-base74M~290 MB
openai/whisper-small244M~950 MB

Models are downloaded from HuggingFace on first use. Device auto-detection: Metal → CUDA → CPU.

Channel Support

  • Telegram: voice notes and audio files downloaded automatically
  • Slack: audio uploads detected, downloaded via url_private_download (25 MB limit, .slack.com host validation). Requires files:read OAuth scope
  • CLI/TUI: no audio input mechanism

Limits

  • 5-minute audio duration guard (candle backend)
  • 25 MB file size limit
  • No streaming transcription — entire file processed in one pass
  • One audio attachment per message

Image Input

Pipeline: Image attachment → MessagePart::Image → LLM provider (base64) → Response

Provider Support

ProviderVisionNotes
ClaudeYesAnthropic image content block
OpenAIYesimage_url data-URI
OllamaYesOptional vision_model routing
CandleNoText-only

Ollama Vision Model

Route image requests to a dedicated model while keeping a smaller text model for regular queries:

[llm]
model = "mistral:7b"
vision_model = "llava:13b"

Sending Images

  • CLI/TUI: /image /path/to/screenshot.png What is shown in this image?
  • Telegram: send a photo directly; the caption becomes the prompt

Limits

  • 20 MB maximum image size
  • One image per message
  • No image generation (input only)