Skip to content

audio: add SpeechToText trait and OpenAI Whisper backend #522

@bug-ops

Description

@bug-ops

Parent: #520
Depends on: #521

Context

Need a pluggable transcription abstraction so different STT backends (OpenAI Whisper API, local Whisper, future providers) can be used interchangeably.

Design

New module: crates/zeph-llm/src/stt.rs (or separate crate zeph-stt)

pub trait SpeechToText: Send + Sync {
    fn transcribe(
        &self,
        audio: &[u8],
        mime_type: &str,
    ) -> impl Future<Output = Result<Transcript, SttError>> + Send;
}

pub struct Transcript {
    pub text: String,
    pub language: Option<String>,
    pub duration_secs: Option<f32>,
}

pub enum SttError {
    UnsupportedFormat(String),
    FileTooLarge { size: usize, max: usize },
    TranscriptionFailed(String),
    NetworkError(String),
}

OpenAI Whisper backend

pub struct WhisperApi {
    client: reqwest::Client,
    api_key: String,
    model: String,  // "whisper-1"
}
  • Uses POST /v1/audio/transcriptions with multipart form
  • 25 MB file size limit
  • Supports: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
  • Language detection or explicit language parameter

Config

[audio]
enabled = true
backend = "whisper-api"  # or "whisper-local", "none"
language = "auto"        # or "en", "ru", etc.

[audio.whisper_api]
model = "whisper-1"

Acceptance criteria

  • SpeechToText trait defined
  • WhisperApi implementation with multipart upload
  • Supported format validation
  • File size check before upload
  • Config section for audio settings
  • Unit tests with mock HTTP responses

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions