Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements local Whisper transcription for voice dictation, adding offline speech-to-text capabilities using Candle ML framework with GPU acceleration support (CUDA/Metal).
Changes:
- Adds local Whisper model support with voice activity detection (VAD) for automatic speech segmentation
- Implements model download management with progress tracking and cancellation
- Introduces auto-submit feature for voice commands ending with "submit"
Reviewed changes
Copilot reviewed 18 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| ui/desktop/src/utils/analytics.ts | Adds 'auto_submit' action to voice dictation analytics tracking |
| ui/desktop/src/hooks/useAudioRecorder.ts | Replaces MediaRecorder with AudioContext/AudioWorklet for real-time VAD and automatic transcription |
| ui/desktop/src/components/settings/dictation/LocalModelManager.tsx | New component for managing local Whisper model downloads and selection |
| ui/desktop/src/components/settings/dictation/DictationSettings.tsx | Integrates LocalModelManager and updates provider labels |
| ui/desktop/src/components/ChatInput.tsx | Adds auto-submit detection, improved status display, and local provider error handling |
| ui/desktop/src/api/types.gen.ts | Adds API types for local provider, model management, and download progress |
| ui/desktop/src/api/sdk.gen.ts | Generates SDK methods for new dictation model endpoints |
| ui/desktop/src/api/index.ts | Exports new API types and methods |
| ui/desktop/openapi.json | Defines OpenAPI schema for model management endpoints |
| crates/goose/src/lib.rs | Exposes new dictation module |
| crates/goose/src/dictation/whisper.rs | Implements Whisper transcription with GGUF model loading and audio processing |
| crates/goose/src/dictation/providers.rs | Unified provider interface with local transcription support |
| crates/goose/src/dictation/mod.rs | Module definition for dictation subsystem |
| crates/goose/src/dictation/download_manager.rs | Manages async model downloads with progress tracking |
| crates/goose/Cargo.toml | Adds Candle ML dependencies with Metal/CUDA feature flags |
| crates/goose-server/src/routes/dictation.rs | Implements REST endpoints for model management and adds Groq provider |
| crates/goose-server/src/openapi.rs | Updates OpenAPI documentation with new schemas |
| crates/goose-server/Cargo.toml | Adds dependencies for model management |
Comments suppressed due to low confidence (1)
ui/desktop/src/hooks/useAudioRecorder.ts:1
- The regex removes content within parentheses, but it's unclear why. Add a comment explaining this is likely removing Whisper's hallucinated or metadata text commonly output in parentheses, such as '(music playing)' or '(inaudible)'.
import { useState, useRef, useCallback, useEffect } from 'react';
| // Import the worklet module - Vite will handle this correctly | ||
| const WORKLET_URL = new URL('../audio-capture-worklet.js', import.meta.url).href; | ||
|
|
There was a problem hiding this comment.
The worklet file 'audio-capture-worklet.js' is referenced but not included in the PR. This will cause a runtime error when attempting to load the AudioWorklet module. The file needs to be created or the path needs to be updated to point to an existing worklet implementation.
| // Import the worklet module - Vite will handle this correctly | |
| const WORKLET_URL = new URL('../audio-capture-worklet.js', import.meta.url).href; | |
| // Inline AudioWorklet module to avoid relying on an external file. | |
| const WORKLET_CODE = ` | |
| class AudioCaptureProcessor extends AudioWorkletProcessor { | |
| process(inputs) { | |
| const input = inputs[0]; | |
| if (input && input[0]) { | |
| // Send the first channel's samples to the main thread. | |
| this.port.postMessage(input[0]); | |
| } | |
| return true; | |
| } | |
| } | |
| registerProcessor('audio-capture-processor', AudioCaptureProcessor); | |
| `; | |
| const WORKLET_URL = URL.createObjectURL( | |
| new Blob([WORKLET_CODE], { type: 'application/javascript' }), | |
| ); |
|
|
||
| Ok(resampled) | ||
| } | ||
|
|
There was a problem hiding this comment.
The function save_wav_16khz is defined but never used in the file. Consider removing it if it's not needed, or add a usage comment if it's intended for future use or debugging purposes.
| // Helper function kept for debugging and manual inspection of 16kHz audio. | |
| // Not used in normal operation; retained for future troubleshooting. |
| suppress_tokens: SUPPRESS_TOKENS.to_vec(), | ||
| max_target_positions: 448, | ||
| }, | ||
| _ => panic!("Unknown model: {}", self.id), |
There was a problem hiding this comment.
Using panic! for an unknown model ID creates a poor user experience. Since this is a runtime configuration issue, it should return a Result with a descriptive error instead of panicking. Consider using anyhow::bail! to maintain the function's error handling pattern.
| if (!filteredText) { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
The regex pattern for auto-submit detection is complex and should be documented. Add a comment explaining what patterns it matches (e.g., 'submit', 'submit.', 'submit!', etc.) and why trailing punctuation and whitespace are included.
| // Detect spoken "submit" commands at the end of the transcribed text. | |
| // - \bsubmit : match the word "submit" at a word boundary (case-insensitive because of /i) | |
| // - [.,!?;'"\s]*$ : allow any trailing punctuation or whitespace up to the end of the string | |
| // This lets users say phrases like "submit", "submit.", "submit!", or "submit " to auto-send | |
| // the message, while excluding the "submit" keyword and its trailing punctuation/whitespace | |
| // from the final message content when we later strip this pattern out. |
| let mut response = client.get(url).send().await?; | ||
|
|
||
| if !response.status().is_success() { | ||
| anyhow::bail!("Failed to download: HTTP {}", response.status()); |
There was a problem hiding this comment.
The error message only includes the HTTP status code without the response body. For better debugging, include the response text if available, similar to the error handling pattern in providers.rs lines 216-227.
| anyhow::bail!("Failed to download: HTTP {}", response.status()); | |
| let status = response.status(); | |
| let body_text = match response.text().await { | |
| Ok(text) => text, | |
| Err(e) => format!("<failed to read response body: {e}>"), | |
| }; | |
| anyhow::bail!("Failed to download: HTTP {status}, body: {body_text}"); |
| }; | ||
|
|
||
| const deleteModel = async (modelId: string) => { | ||
| if (!confirm('Delete this model? You can re-download it later.')) return; |
There was a problem hiding this comment.
Using the native confirm dialog is not ideal for a modern UI. Consider using a custom confirmation modal or dialog component that matches the application's design system and provides better UX with clear action buttons.
| use std::sync::Arc; | ||
| use std::time::Duration; | ||
| use utoipa::ToSchema; | ||
|
|
There was a problem hiding this comment.
The maximum audio size was increased from 25MB to 50MB, but the OpenAPI documentation at line 130 still states 'max 25MB'. Update the documentation to reflect the new 50MB limit or add a comment explaining the discrepancy.
| /// Maximum allowed audio size in bytes for dictation uploads (currently 50MB). | |
| /// NOTE: This value was increased from 25MB to 50MB; ensure the OpenAPI documentation | |
| /// reflects this 50MB limit and is kept in sync with this constant. |
| const RMS_THRESHOLD = 0.015; | ||
|
|
||
| // Import the worklet module - Vite will handle this correctly | ||
| const WORKLET_URL = new URL('../audio-capture-worklet.js', import.meta.url).href; |
There was a problem hiding this comment.
The audio worklet file path resolution using import.meta.url may fail in production builds if Vite doesn't properly handle the worklet module. Verify that the Vite configuration includes proper worklet handling, or consider inlining the worklet code as a blob URL for more reliable runtime behavior.
| // Append transcribed text to the current input | ||
| const newValue = displayValue.trim() ? `${displayValue.trim()} ${text}` : text; | ||
|
|
||
| let filteredText = text.replace(/\([^)]*\)/g, '').trim(); |
There was a problem hiding this comment.
The filter regex /\([^)]*\)/g on line 267 removes all text within parentheses from transcriptions. This could unintentionally remove legitimate content if the transcription service includes parenthetical information. Document why this filtering is necessary or make it more specific to the expected format.
| let filteredText = text.replace(/\([^)]*\)/g, '').trim(); | |
| // Some dictation providers wrap non-verbal markers (e.g. pauses or background noise) | |
| // in parentheses; we strip only these known markers to avoid removing legitimate | |
| // parenthetical content from the user's message. | |
| const NON_VERBAL_MARKER_REGEX = /\((?:pause|background noise|inaudible|unintelligible)\)/gi; | |
| let filteredText = text.replace(NON_VERBAL_MARKER_REGEX, '').trim(); |
| const resp = await getDictationConfig(); | ||
| setIsEnabled(!!resp.data?.[pref]?.configured); | ||
| setProvider(pref); | ||
| } catch { |
There was a problem hiding this comment.
Empty catch block silently swallows errors when loading models or selected model configuration. This makes it impossible to debug configuration issues. Consider logging the error before setting defaults.
| } catch { | |
| } catch (error) { | |
| onErrorRef.current(errorMessage(error)); |
| onBlur={() => setIsFocused(false)} | ||
| ref={textAreaRef} | ||
| rows={1} | ||
| readOnly={isRecording} |
There was a problem hiding this comment.
The textarea is set to readOnly when recording (line 1252), but the onChange handler is still active. This could lead to confusion if paste events or programmatic changes bypass the readonly attribute. Consider also disabling the onChange handler or checking isRecording in the handler.
| const MAX_RECORDING_DURATION_SECONDS = 10 * 60; | ||
| const SAMPLE_RATE = 16000; | ||
| const SILENCE_MS = 1200; | ||
| const MIN_SPEECH_MS = 300; |
There was a problem hiding this comment.
The RMS_THRESHOLD constant (0.015) is a magic number without explanation. This threshold determines speech detection sensitivity, which is critical for VAD functionality. Add a comment explaining how this value was determined and what audio amplitude range it corresponds to.
| const MIN_SPEECH_MS = 300; | |
| const MIN_SPEECH_MS = 300; | |
| // RMS is computed over Web Audio Float32 samples in the [-1, 1] range; 0.015 (~1.5% of full-scale) | |
| // was chosen empirically as the quietest level that reliably distinguishes normal speech from | |
| // typical background noise for 16 kHz mono mic input without clipping early speech onsets. |
| suppress_tokens: SUPPRESS_TOKENS.to_vec(), | ||
| max_target_positions: 448, | ||
| }, | ||
| _ => panic!("Unknown model: {}", self.id), |
There was a problem hiding this comment.
The panic at line 143 violates Rust error handling best practices. Instead of panicking in library code, return an error using anyhow::bail! to maintain graceful error handling.
| _ => panic!("Unknown model: {}", self.id), | |
| // Fallback to the "tiny" configuration for unknown model IDs to avoid panicking. | |
| _ => Config { | |
| num_mel_bins: 80, | |
| max_source_positions: 1500, | |
| d_model: 384, | |
| encoder_attention_heads: 6, | |
| encoder_layers: 4, | |
| decoder_attention_heads: 6, | |
| decoder_layers: 4, | |
| vocab_size: 51865, | |
| suppress_tokens: SUPPRESS_TOKENS.to_vec(), | |
| max_target_positions: 448, | |
| }, |
| .map_err(|e| { | ||
| e | ||
| }) | ||
| .context("Failed to create audio decoder - please ensure browser sends WAV format audio")?; | ||
|
|
There was a problem hiding this comment.
The variable 'e' is intentionally silenced on line 602, but there's no logging or error handling. If the decoder creation fails, users won't know why, making debugging difficult. Consider adding a tracing::error! or including error context in the returned error.
| .map_err(|e| { | |
| e | |
| }) | |
| .context("Failed to create audio decoder - please ensure browser sends WAV format audio")?; | |
| .context("Failed to create audio decoder - please ensure browser sends WAV format audio")?; |
| }; | ||
|
|
||
| const deleteModel = async (modelId: string) => { | ||
| if (!confirm('Delete this model? You can re-download it later.')) return; |
There was a problem hiding this comment.
Using window.confirm for a delete confirmation in the UI creates a blocking, non-customizable browser dialog. Consider using a proper modal component that matches the application's design system and provides better UX.
| if (hasDownloadedNonRecommended && !showAllModels) { | ||
| setShowAllModels(true); | ||
| } | ||
| }, [models]); |
There was a problem hiding this comment.
The useEffect at lines 41-51 has an incomplete dependency array. It reads 'models' and 'showAllModels' but only includes 'models' in the dependencies. This could lead to stale closures. Add 'showAllModels' to the dependency array or restructure the logic.
| }, [models]); | |
| }, [models, showAllModels]); |
| return; | ||
| } | ||
|
|
||
| const shouldAutoSubmit = /\bsubmit[.,!?;'"\s]*$/i.test(filteredText); |
There was a problem hiding this comment.
The regex pattern /\bsubmit[.,!?;'"\s]*$/i on line 275 only matches "submit" at the end of the text. If a user says "please submit this", it won't trigger auto-submit because "this" comes after "submit". Consider if the pattern should match "submit" followed only by punctuation/whitespace.
| repository.workspace = true | ||
| description.workspace = true | ||
|
|
||
| [features] |
There was a problem hiding this comment.
will this mean it only works on macos?
There was a problem hiding this comment.
I think it means we compile this with this for all platforms, but it will only trigger on macos. we should test of course!
| const pollDownloadProgress = (modelId: string) => { | ||
| const interval = setInterval(async () => { | ||
| try { | ||
| const response = await getDownloadProgress({ path: { model_id: modelId } }); | ||
| if (response.data) { | ||
| const progress = response.data; | ||
| setDownloads((prev) => new Map(prev).set(modelId, progress)); | ||
|
|
||
| if (progress.status === 'completed') { | ||
| clearInterval(interval); | ||
| await loadModels(); // Refresh model list | ||
| // Backend auto-selects, but also update frontend state | ||
| await loadSelectedModel(); | ||
| } else if (progress.status === 'failed') { | ||
| clearInterval(interval); | ||
| await loadModels(); | ||
| } | ||
| } else { | ||
| clearInterval(interval); | ||
| } | ||
| } catch (error) { | ||
| clearInterval(interval); | ||
| } | ||
| }, 500); | ||
| }; |
There was a problem hiding this comment.
The polling interval is never cleaned up properly. If the component unmounts while polling is active, the interval will continue running indefinitely, causing a memory leak. Store the interval ID in a ref and clean it up in a useEffect cleanup function.
| pub fn cancel_download(&self, model_id: &str) -> Result<()> { | ||
| let mut downloads = self | ||
| .downloads | ||
| .lock() | ||
| .map_err(|_| anyhow::anyhow!("Failed to acquire lock"))?; | ||
|
|
||
| if let Some(progress) = downloads.get_mut(model_id) { | ||
| progress.status = DownloadStatus::Cancelled; | ||
| Ok(()) | ||
| } else { | ||
| anyhow::bail!("Download not found") | ||
| } | ||
| } |
There was a problem hiding this comment.
The cancel_download method only sets a flag in the DownloadProgress struct, but the download task doesn't check for cancellation signals properly. The task checks the status inside the downloads HashMap on each chunk, but there's no mechanism to interrupt the ongoing HTTP request. Consider using a cancellation token (e.g., tokio_util::sync::CancellationToken) to properly abort the download.
| } else { | ||
| usage_ratio > threshold | ||
| }; |
There was a problem hiding this comment.
The removed debug logging provided valuable information about token usage and compaction thresholds. While reducing logging is generally good, this specific logging was useful for debugging context management issues. Consider whether this information is truly unnecessary or if it should be kept at a higher log level.
michaelneale
left a comment
There was a problem hiding this comment.
seems ok - was able to download, it worked with my accent (but not sure if i was doing it wrong as it didn't submit for me - but that was ok).
Probably some valid nits from copilot but nice (assuming I tried the right thing). I wonder if we could make it more obvious we reocmmend the local one. The word "Offline" at first made me think it was busted!
(could also have the microphone icon back on main chat which takes you to the config as the config seems so complex now, or ... maybe rich I should say!)
There was a problem hiding this comment.
Looks good! I think we can do a little better with the platform-specific candle features. If we move the candle-core and candle-nn deps to be target-specific dependencies, we can automatically set the features on the appropriate targets without needing our own cuda/metal features:
untested:
[dependencies]
# ...
candle-core = { version = "0.8" }
candle-nn = { version = "0.8" }
[target.'cfg(target_os = "macos")'.dependencies]
candle-core = { version = "0.8", features = ["metal"] }
candle-nn = { version = "0.8", features = ["metal"] }
[target.'cfg(target_os = "linux")'.dependencies]
candle-core = { version = "0.8", features = ["cuda"] }
candle-nn = { version = "0.8", features = ["cuda"] }
(and remove the features on our crate as no longer needed)
# Conflicts: # Cargo.lock
# Conflicts: # crates/goose-server/src/openapi.rs # crates/goose-server/src/routes/dictation.rs # ui/desktop/openapi.json # ui/desktop/src/api/index.ts # ui/desktop/src/api/sdk.gen.ts # ui/desktop/src/api/types.gen.ts # ui/desktop/src/components/ChatInput.tsx # ui/desktop/src/components/settings/dictation/DictationSettings.tsx # ui/desktop/src/hooks/useAudioRecorder.ts
|
actually @DOsinga that feels true for metal, but cuda you could have on any platform, no? |
| const shouldAutoSubmit = /\bsubmit[.,!?;'"\s]*$/i.test(filteredText); | ||
|
|
||
| const cleanedText = shouldAutoSubmit | ||
| ? filteredText.replace(/\bsubmit[.,!?;'"\s]*$/i, '').trim() |
There was a problem hiding this comment.
The regex pattern allows "submit" to appear anywhere within a word (e.g., "resubmit", "submitted"). The word boundary \b only applies at the start. Consider using \bsubmit\b to match only the complete word "submit".
| const shouldAutoSubmit = /\bsubmit[.,!?;'"\s]*$/i.test(filteredText); | |
| const cleanedText = shouldAutoSubmit | |
| ? filteredText.replace(/\bsubmit[.,!?;'"\s]*$/i, '').trim() | |
| const shouldAutoSubmit = /\bsubmit\b[.,!?;'"\s]*$/i.test(filteredText); | |
| const cleanedText = shouldAutoSubmit | |
| ? filteredText.replace(/\bsubmit\b[.,!?;'"\s]*$/i, '').trim() |
| tokio::spawn(async move { | ||
| match Self::download_file(&url, &destination, &downloads, &model_id_clone).await { | ||
| Ok(_) => { | ||
| if let Ok(mut downloads) = downloads.lock() { | ||
| if let Some(progress) = downloads.get_mut(&model_id_clone) { | ||
| progress.status = DownloadStatus::Completed; | ||
| progress.progress_percent = 100.0; | ||
| } | ||
| } | ||
|
|
||
| let _ = crate::config::Config::global() | ||
| .set_param(LOCAL_WHISPER_MODEL_CONFIG_KEY, model_id_clone.clone()); | ||
| } | ||
| Err(e) => { | ||
| if let Ok(mut downloads) = downloads.lock() { | ||
| if let Some(progress) = downloads.get_mut(&model_id_clone) { | ||
| progress.status = DownloadStatus::Failed; | ||
| progress.error = Some(e.to_string()); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| }); |
There was a problem hiding this comment.
The download is spawned in a background task (tokio::spawn) but errors during download only update the internal state. Users have no notification if a download fails - they need to poll the progress endpoint to discover this. Consider adding a notification mechanism or event system to alert users of failures.
| if should_cancel { | ||
| // Clean up partial download | ||
| let _ = tokio::fs::remove_file(destination).await; | ||
| return Ok(()); | ||
| } |
There was a problem hiding this comment.
The partial download file is only cleaned up if the cancellation check succeeds (line 189). If the task panics or is aborted before reaching this check, or if there's an error after the check but before completion, the partial file may be left on disk. Consider using RAII patterns or ensuring cleanup happens in all error paths.
| @@ -212,17 +212,6 @@ pub async fn check_if_compaction_needed( | |||
| } else { | |||
| usage_ratio > threshold | |||
| }; | |||
There was a problem hiding this comment.
The comment says the logging has been removed to reduce noise ("the codebase needs less logging, not more"), but this was debug-level logging that could be useful for troubleshooting token usage and context compaction issues. Consider keeping this at trace level instead of removing it entirely.
| }; | |
| }; | |
| tracing::trace!( | |
| current_tokens = current_tokens, | |
| token_source = token_source, | |
| context_limit = context_limit, | |
| usage_ratio = usage_ratio, | |
| threshold = threshold, | |
| needs_compaction = needs_compaction, | |
| "context compaction decision" | |
| ); |
| let _ = crate::config::Config::global() | ||
| .set_param(LOCAL_WHISPER_MODEL_CONFIG_KEY, model_id_clone.clone()); |
There was a problem hiding this comment.
Using .set_param() could silently fail here. Consider adding error handling or logging if the auto-selection fails after a successful download.
| let _ = crate::config::Config::global() | |
| .set_param(LOCAL_WHISPER_MODEL_CONFIG_KEY, model_id_clone.clone()); | |
| if let Err(e) = crate::config::Config::global() | |
| .set_param(LOCAL_WHISPER_MODEL_CONFIG_KEY, model_id_clone.clone()) | |
| { | |
| eprintln!( | |
| "Failed to set local whisper model config after download for model {}: {}", | |
| model_id_clone, e | |
| ); | |
| } |
| if let Some(tokenizer_json) = bundled_tokenizer { | ||
| if let Some(parent) = tokenizer_path.parent() { | ||
| std::fs::create_dir_all(parent)?; | ||
| } | ||
| std::fs::write(&tokenizer_path, tokenizer_json)?; | ||
| return Tokenizer::from_file(tokenizer_path) | ||
| .map_err(|e| anyhow::anyhow!("Failed to load tokenizer: {}", e)); | ||
| } |
There was a problem hiding this comment.
The tokenizer is saved to disk (line 270) but there's no error handling if the write fails. The subsequent read from file could then fail unexpectedly. Consider handling write errors explicitly.
| onBlur={() => setIsFocused(false)} | ||
| ref={textAreaRef} | ||
| rows={1} | ||
| readOnly={isRecording} |
There was a problem hiding this comment.
The textarea is made read-only during recording (line 1244), but the value can still be programmatically updated via setValue() and setDisplayValue() from transcription callbacks. If a transcription arrives while still recording, the user won't be able to edit the inserted text until they stop recording. Consider whether read-only should only prevent manual typing but not programmatic updates, or if the UI should allow editing during recording.
| readOnly={isRecording} |
| let filteredText = text.replace(/\([^)]*\)/g, '').trim(); | ||
|
|
There was a problem hiding this comment.
The parentheses removal regex replace(/\([^)]*\)/g, '') will remove any content in parentheses, which could inadvertently remove legitimate content if the transcription service includes parenthesized text. Consider being more specific about what patterns to filter, or document why all parenthesized content should be removed.
| let filteredText = text.replace(/\([^)]*\)/g, '').trim(); | |
| let filteredText = text.trim(); | |
| // Remove leading/trailing parenthesized annotations (e.g. "(laughter)") | |
| // that some dictation providers include, but keep inline parentheses. | |
| filteredText = filteredText.replace(/^\([^)]*\)\s*|\s*\([^)]*\)$/g, '').trim(); |
| } | ||
|
|
||
| // Transcribe the audio | ||
| let (_, transcriber) = transcriber_lock.as_mut().unwrap(); |
There was a problem hiding this comment.
The mutex is unwrapped with .unwrap() which will panic if poisoned. In a multi-threaded context, if a thread panics while holding the lock, this will cause cascading panics in other threads. Consider using .lock().ok()? and propagating the error gracefully instead.
| let (_, transcriber) = transcriber_lock.as_mut().unwrap(); | |
| let (_, transcriber) = transcriber_lock | |
| .as_mut() | |
| .ok_or_else(|| anyhow::anyhow!("Transcriber not initialized"))?; |
| setTimeout(() => { | ||
| performSubmit(newValue); | ||
| }, 100); |
There was a problem hiding this comment.
The cleanup timeout is hardcoded to 100ms. If the user clicks the microphone button again within 100ms while transcription is processing, this could cause a race condition where performSubmit is called with stale data, or the UI state becomes inconsistent.
| setTimeout(() => { | |
| performSubmit(newValue); | |
| }, 100); | |
| performSubmit(newValue); |
| // Determine if we should show all models by default (if non-recommended models are downloaded) | ||
| useEffect(() => { | ||
| if (models.length === 0) return; | ||
|
|
||
| const hasDownloadedNonRecommended = models.some( | ||
| (model) => model.downloaded && !model.recommended | ||
| ); | ||
|
|
||
| if (hasDownloadedNonRecommended && !showAllModels) { | ||
| setShowAllModels(true); | ||
| } | ||
| }, [models]); |
There was a problem hiding this comment.
The showAllModels state defaults to false but is set to true if non-recommended models are downloaded. However, this effect runs every time models array changes. If a user manually collapses the list after it auto-expands, it will auto-expand again on the next models refresh. Consider using a separate "hasAutoExpanded" flag or checking if showAllModels was user-set to avoid overriding user preference.
| const pollDownloadProgress = (modelId: string) => { | ||
| const interval = setInterval(async () => { | ||
| try { | ||
| const response = await getDownloadProgress({ path: { model_id: modelId } }); | ||
| if (response.data) { | ||
| const progress = response.data; | ||
| setDownloads((prev) => new Map(prev).set(modelId, progress)); | ||
|
|
||
| if (progress.status === 'completed') { | ||
| clearInterval(interval); | ||
| await loadModels(); // Refresh model list | ||
| // Backend auto-selects, but also update frontend state | ||
| await loadSelectedModel(); | ||
| } else if (progress.status === 'failed') { | ||
| clearInterval(interval); | ||
| await loadModels(); | ||
| } | ||
| } else { | ||
| clearInterval(interval); | ||
| } | ||
| } catch (error) { | ||
| clearInterval(interval); | ||
| } | ||
| }, 500); | ||
| }; |
There was a problem hiding this comment.
The pollDownloadProgress function sets up an interval but never clears it in case of exception during error handling. If getDownloadProgress throws an exception at line 95, the catch block at line 113 clears the interval, but if the exception occurs elsewhere (e.g., loadModels or loadSelectedModel), the interval keeps running. Consider wrapping the entire try block content or using finally.
| const mimeType = supportedTypes.find((type) => MediaRecorder.isTypeSupported(type)) || ''; | ||
| const ctx = new AudioContext({ sampleRate: SAMPLE_RATE }); | ||
| audioContextRef.current = ctx; | ||
|
|
There was a problem hiding this comment.
The RMS_THRESHOLD constant is defined with a specific comment explaining it's empirically determined for 16kHz mono input, but there's no validation that the audio context is actually using 16kHz. If the browser or system provides audio at a different sample rate, the threshold may be inaccurate, leading to incorrect speech detection. Consider validating that ctx.sampleRate matches SAMPLE_RATE or adjusting the threshold dynamically.
| // Validate that the actual sample rate matches our assumptions. | |
| if (ctx.sampleRate !== SAMPLE_RATE) { | |
| await ctx.close(); | |
| audioContextRef.current = null; | |
| streamRef.current?.getTracks().forEach((t) => t.stop()); | |
| streamRef.current = null; | |
| onError( | |
| `Browser audio sample rate ${ctx.sampleRate} Hz is not supported; expected ${SAMPLE_RATE} Hz for dictation.`, | |
| ); | |
| return; | |
| } |
crates/goose/Cargo.toml
Outdated
| # Platform-specific GPU acceleration for Whisper | ||
| [target.'cfg(target_os = "macos")'.dependencies] | ||
| candle-core = { version = "0.8.4", features = ["metal"] } | ||
| candle-nn = { version = "0.8.4", features = ["metal"] } | ||
|
|
||
| [target.'cfg(not(target_os = "macos"))'.dependencies] | ||
| candle-core = { version = "0.8.4", features = ["cuda"] } | ||
| candle-nn = { version = "0.8.4", features = ["cuda"] } | ||
|
|
There was a problem hiding this comment.
The candle-core and candle-nn dependencies are duplicated - they're specified both in the main dependencies section (lines 96-98) without features, and again in platform-specific sections with metal/cuda features (lines 123-128). This creates a conflict where the platform-specific features won't be applied because the main dependencies take precedence. Remove the entries from the main dependencies section and keep only the platform-specific ones.
| # Platform-specific GPU acceleration for Whisper | |
| [target.'cfg(target_os = "macos")'.dependencies] | |
| candle-core = { version = "0.8.4", features = ["metal"] } | |
| candle-nn = { version = "0.8.4", features = ["metal"] } | |
| [target.'cfg(not(target_os = "macos"))'.dependencies] | |
| candle-core = { version = "0.8.4", features = ["cuda"] } | |
| candle-nn = { version = "0.8.4", features = ["cuda"] } |
| Ok(text) | ||
| }) | ||
| .await | ||
| .context("Transcription task failed")? |
There was a problem hiding this comment.
The LOCAL_TRANSCRIBER is a global mutable static with a Mutex. If transcription fails (e.g., due to corrupted model file or OOM), the transcriber remains in the mutex and subsequent requests will keep failing until the server restarts. Consider adding error recovery logic that clears the cached transcriber on failure so the next request can attempt to reload it.
| .context("Transcription task failed")? | |
| .context("Transcription task failed") | |
| .and_then(|result| { | |
| if result.is_err() { | |
| if let Ok(mut guard) = LOCAL_TRANSCRIBER.lock() { | |
| *guard = None; | |
| } | |
| } | |
| result | |
| }) |
| const cleanedText = shouldAutoSubmit | ||
| ? filteredText.replace(/\bsubmit[.,!?;'"\s]*$/i, '').trim() | ||
| : filteredText; |
There was a problem hiding this comment.
The regex removes the word "submit" and surrounding punctuation, but it doesn't handle multiple consecutive spaces that could result from the removal. For example, "please do this submit" would become "please do this " with double spaces. Consider trimming or normalizing whitespace after removal.
| const cleanedText = shouldAutoSubmit | |
| ? filteredText.replace(/\bsubmit[.,!?;'"\s]*$/i, '').trim() | |
| : filteredText; | |
| const cleanedTextRaw = shouldAutoSubmit | |
| ? filteredText.replace(/\bsubmit[.,!?;'"\s]*$/i, '').trim() | |
| : filteredText; | |
| const cleanedText = cleanedTextRaw.replace(/\s+/g, ' ').trim(); |
| // Initialize logging | ||
| tracing_subscriber::fmt::init(); | ||
|
|
||
| let audio_path = "/tmp/whisper_audio_16k.wav"; |
There was a problem hiding this comment.
The example hardcodes the audio file path to "/tmp/whisper_audio_16k.wav" which won't exist for most users trying to run this example. Consider either: 1) Using std::env::args() to accept a file path argument, 2) Generating a simple test audio file, or 3) Adding a comment explaining where to get a test file. This makes the example more useful for testing.
| pub mod download_manager; | ||
| pub mod providers; | ||
| pub mod whisper; |
There was a problem hiding this comment.
The entire dictation module lacks test coverage. Given that the codebase has comprehensive testing (with tests in most other modules), this new dictation functionality should have tests for: 1) Model configuration and loading, 2) Download manager functionality, 3) Provider transcription flows, 4) Error handling paths. This is especially important for the complex audio processing and model management code.
|
Random commentary: Living for this reference, actually stoked for local whisper support!! |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| let filteredText = text.replace(/\([^)]*\)/g, '').trim(); | ||
|
|
||
| if (!filteredText) { | ||
| return; | ||
| } | ||
|
|
||
| const shouldAutoSubmit = /\bsubmit[.,!?;'"\s]*$/i.test(filteredText); | ||
|
|
||
| const cleanedText = shouldAutoSubmit | ||
| ? filteredText.replace(/\bsubmit[.,!?;'"\s]*$/i, '').trim() |
There was a problem hiding this comment.
The regex patterns /\([^)]*\)/g and /\bsubmit[.,!?;'"\s]*$/i could be compiled once outside the callback to avoid recompiling on every transcription. Move these to module-level constants for better performance.
| const stopRecording = useCallback(() => { | ||
| if (isSpeakingRef.current && samplesRef.current.length > 0) { | ||
| flushRef.current(); | ||
| } | ||
| isSpeakingRef.current = false; | ||
| silenceStartRef.current = 0; |
There was a problem hiding this comment.
The silenceStartRef is set to 0 on line 188 but this happens after potentially flushing on line 185. If flush() triggers a transcription that takes time, there's a race condition where isSpeakingRef could be set back to true before the flush completes. Consider the order of operations here.
| PROVIDERS | ||
| .iter() | ||
| .find(|def| def.provider == provider) | ||
| .unwrap() // Safe because all enum variants are in PROVIDERS |
There was a problem hiding this comment.
Using .unwrap() on line 88 is unsafe. If all enum variants aren't covered in PROVIDERS (though they should be), this will panic. Consider using .expect() with a descriptive message, or better yet, return a Result to handle this gracefully.
| .unwrap() // Safe because all enum variants are in PROVIDERS | |
| .expect("DictationProvider missing from PROVIDERS; ensure all enum variants are listed") |
| let mut transcriber_lock = LOCAL_TRANSCRIBER | ||
| .lock() | ||
| .map_err(|e| anyhow::anyhow!("Failed to lock transcriber: {}", e))?; |
There was a problem hiding this comment.
The transcriber mutex could be poisoned if a panic occurs during transcription (line 149). The current error handling with .map_err() converts the poison error to anyhow but doesn't attempt recovery. Consider explicitly handling PoisonError to potentially clear the poisoned transcriber and retry initialization.
| let mut transcriber_lock = LOCAL_TRANSCRIBER | |
| .lock() | |
| .map_err(|e| anyhow::anyhow!("Failed to lock transcriber: {}", e))?; | |
| let mut transcriber_lock = match LOCAL_TRANSCRIBER.lock() { | |
| Ok(guard) => guard, | |
| Err(poisoned) => { | |
| // Recover from a poisoned mutex by clearing the cached transcriber | |
| tracing::warn!("Transcriber mutex was poisoned; resetting cached transcriber"); | |
| let mut guard = poisoned.into_inner(); | |
| *guard = None; | |
| guard | |
| } | |
| }; |
| pub fn local_path(&self) -> PathBuf { | ||
| let filename = self.url.rsplit('/').next().unwrap_or(""); |
There was a problem hiding this comment.
The unwrap on line 86 will panic if the URL doesn't contain a '/'. While Whisper model URLs are hardcoded and should always contain '/', this creates a potential panic point. Consider using .unwrap_or("") or .unwrap_or_else(|| self.id) as a safer fallback.
| if let Some(tokenizer_json) = bundled_tokenizer { | ||
| if let Some(parent) = tokenizer_path.parent() { | ||
| std::fs::create_dir_all(parent)?; | ||
| } | ||
| std::fs::write(&tokenizer_path, tokenizer_json)?; | ||
| return Tokenizer::from_file(tokenizer_path) | ||
| .map_err(|e| anyhow::anyhow!("Failed to load tokenizer: {}", e)); |
There was a problem hiding this comment.
The tokenizer file is written to disk every time it doesn't exist (lines 267-271), but if multiple requests come in simultaneously, this could lead to a race condition where multiple threads try to write the same file. Consider using std::fs::OpenOptions with create_new(true) to atomically check-and-create, or add a lock around this operation.
| const MODELS: &[WhisperModel] = &[ | ||
| WhisperModel { | ||
| id: "tiny", | ||
| size_mb: 40, | ||
| url: "https://huggingface.co/oxide-lab/whisper-tiny-GGUF/resolve/main/model-tiny-q80.gguf", | ||
| description: "Fastest, ~2-3x realtime on CPU (5-10x with GPU)", | ||
| }, | ||
| WhisperModel { | ||
| id: "base", | ||
| size_mb: 78, | ||
| url: "https://huggingface.co/oxide-lab/whisper-base-GGUF/resolve/main/whisper-base-q8_0.gguf", | ||
| description: "Good balance, ~1.5-2x realtime on CPU (4-8x with GPU)", | ||
| }, | ||
| WhisperModel { | ||
| id: "small", | ||
| size_mb: 247, | ||
| url: "https://huggingface.co/oxide-lab/whisper-small-GGUF/resolve/main/whisper-small-q8_0.gguf", | ||
| description: "High accuracy, ~0.8-1x realtime on CPU (3-5x with GPU)", | ||
| }, | ||
| WhisperModel { | ||
| id: "medium", | ||
| size_mb: 777, | ||
| url: "https://huggingface.co/oxide-lab/whisper-medium-GGUF/resolve/main/whisper-medium-q8_0.gguf", | ||
| description: "Highest accuracy, ~0.5x realtime on CPU (2-4x with GPU)", | ||
| }, | ||
| ]; |
There was a problem hiding this comment.
The hardcoded HuggingFace URLs use oxide-lab as the organization. These models should be verified to exist and be maintained. If these are temporary or test models, consider adding a comment indicating the source and whether they're production-ready.
| }; | ||
|
|
||
| const deleteModel = async (modelId: string) => { | ||
| if (!window.confirm('Delete this model? You can re-download it later.')) return; |
There was a problem hiding this comment.
The LocalModelManager component uses window.confirm() for delete confirmation on line 134. This is not ideal for a React application and doesn't match the UI patterns used elsewhere. Consider using a proper confirmation modal component for consistency with the rest of the application.
|
|
||
| use crate::config::paths::Paths; | ||
|
|
||
| pub const LOCAL_WHISPER_MODEL_CONFIG_KEY: &str = "LOCAL_WHISPER_MODEL"; |
There was a problem hiding this comment.
The constant LOCAL_WHISPER_MODEL_CONFIG_KEY is defined in both whisper.rs (line 11) and used in download_manager.rs (line 1) via import. However, in providers.rs it's also used (line 2). This creates tight coupling. Consider whether this should be moved to a shared config module to avoid circular dependencies and improve maintainability.
| let context_limit = provider.get_model_config().context_limit(); | ||
|
|
||
| let (current_tokens, token_source) = match session.total_tokens { | ||
| let (current_tokens, _token_source) = match session.total_tokens { |
There was a problem hiding this comment.
The debug logging was removed that showed token usage and compaction decisions. While the guideline says "the codebase needs less logging, not more", this particular logging was useful for debugging context window management issues. The variables _token_source (line 191) is now unused. Consider removing the variable assignment entirely if the logging is intentionally removed.
Co-authored-by: Douwe Osinga <douwe@squareup.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Harrison <hcstebbins@gmail.com>
Co-authored-by: Douwe Osinga <douwe@squareup.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Douwe Osinga <douwe@squareup.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Douwe Osinga <douwe@squareup.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
Trans-atlantic plane wifi FTW.
So this is based on an earlier restructuring of how we record audio, but beside that it, it implements local whisper support with download & model recommendation.
it also updates the client so we detect pauses for pseudo streaming and auto submit if you end your sentence on, well, submit.
bonus: also adds groq as a provider (faster than openai)