Skip to content

audio: add local Whisper backend via candle (feature-gated) #523

@bug-ops

Description

@bug-ops

Parent: #520
Depends on: #522

Context

For offline/air-gapped environments and zero API cost, provide local speech-to-text using Whisper via candle. Feature-gated behind `candle` (or a new `whisper-local` flag).

Design

Implementation

Leverage candle-transformers Whisper example as reference:

  • Download model weights from HuggingFace (whisper-tiny, whisper-base, whisper-small)
  • Audio decoding: use `symphonia` crate for format handling (ogg, mp3, wav, flac)
  • Mel spectrogram computation + whisper inference via candle

Model options

Model Size VRAM Quality
whisper-tiny 39 MB ~1 GB Good for short commands
whisper-base 74 MB ~1 GB Better accuracy
whisper-small 244 MB ~2 GB Best quality, still fast

Config

[audio]
backend = "whisper-local"

[audio.whisper_local]
model = "whisper-base"  # tiny, base, small
device = "auto"         # cpu, metal, cuda

Feature gate

[features]
whisper-local = ["dep:symphonia", "dep:candle-core", "dep:candle-nn", "dep:candle-transformers", "dep:hf-hub"]

Acceptance criteria

  • Implements `SpeechToText` trait
  • Auto-downloads model on first use via hf-hub
  • Supports Metal (macOS) and CUDA (Linux) acceleration
  • CPU fallback works
  • Feature-gated, does not affect default build
  • Transcription of 10s audio completes in <2s on Metal

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions