Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Local Inference (Candle)

Run HuggingFace GGUF models locally via candle without external API dependencies. Metal and CUDA GPU acceleration are supported.

cargo build --release --features candle,metal  # macOS with Metal GPU

Configuration

[llm]
provider = "candle"

[llm.candle]
source = "huggingface"
repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
filename = "mistral-7b-instruct-v0.2.Q4_K_M.gguf"
chat_template = "mistral"          # llama3, chatml, mistral, phi3, raw
embedding_repo = "sentence-transformers/all-MiniLM-L6-v2"  # optional BERT embeddings

[llm.candle.generation]
temperature = 0.7
top_p = 0.9
top_k = 40
max_tokens = 2048
repeat_penalty = 1.1

Chat Templates

TemplateModels
llama3Llama 3, Llama 3.1
chatmlQwen, Yi, OpenHermes
mistralMistral, Mixtral
phi3Phi-3
rawNo template (raw completion)

Device Auto-Detection

  • macOS — Metal GPU (requires --features metal)
  • Linux with NVIDIA — CUDA (requires --features cuda)
  • Fallback — CPU