Refactor with continuous VAD stream + AI bump#1358
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughReplaced the hypr-chunker crate with a new hypr-vad crate across the workspace, removing the old chunker implementation and exports and adding a continuous VAD stream and speech-chunking API. Updated downstream services, some crate manifests, and Whisper/LLM dependency pins. Changes
Sequence Diagram(s)sequenceDiagram
participant Mic as AsyncSource (mic/speaker)
participant VAD as ContinuousVadStream (hypr-vad)
participant Svc as Streaming Service
participant ASR as Transcriber
Mic->>VAD: stream audio samples
VAD-->>Svc: VadStreamItem::AudioSamples
VAD-->>Svc: VadStreamItem::SpeechStart
VAD-->>Svc: VadStreamItem::SpeechEnd(samples, timestamps)
Svc->>ASR: AudioChunk(samples, start, end)
ASR-->>Svc: transcription segments
sequenceDiagram
participant Source as AsyncSource
participant Ext as VadExt::speech_chunks()
participant Consumer as process_vad_stream
Source->>Ext: build VadConfig (redemption_time)
Ext->>Ext: run ContinuousVadStream -> detect SpeechEnd -> emit AudioChunk
Ext-->>Consumer: Stream<Result<AudioChunk, Error>>
Consumer-->>Downstream: forward SimpleAudioChunk
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
cubic analysis
1 issue found across 17 files • Review in cubic
React with 👍 or 👎 to teach cubic. You can also tag @cubic-dev-ai to give feedback, ask questions, or re-run the review.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (9)
crates/whisper-local/src/model.rs (2)
158-160: Consider tracking confidence calculation as technical debtThe hardcoded confidence value of 1.0 with the comment about API changes suggests this is a temporary workaround. Consider creating a tracking issue for properly implementing confidence scoring once the new whisper-rs API is better understood.
Would you like me to create an issue to track the re-implementation of confidence calculation once the whisper-rs v1.7.6+ API is better documented?
322-322: Consider removing or conditionalizing debug outputThe
println!statement appears to be debug output that should be removed or wrapped in a debug flag before merging to production.let segments = whisper.transcribe(&audio).unwrap(); - println!("segments: {:#?}", segments); + #[cfg(debug_assertions)] + println!("segments: {:#?}", segments); assert!(segments.len() > 0);crates/vad/src/lib.rs (3)
15-45: Equality test for continuous VAD audio passthrough — OK, but heavyExact f32 equality is acceptable here since the VAD path does not transform samples; however, this test decodes and buffers the entire asset twice, which is heavy for CI.
Consider reducing test asset size or marking as ignore for quick CI cycles and enabling in a nightly/extended test run.
47-69: Replace magic number 16.0 with explicit sample rateAvoid magic constants; derive time from an explicit sample rate for clarity and maintainability.
Apply:
- let how_many_sec = (all_audio_from_vad.len() as f64 / 16.0) / 1000.0; + const SAMPLE_RATE_HZ: f64 = 16_000.0; + let how_many_sec = all_audio_from_vad.len() as f64 / SAMPLE_RATE_HZ;
70-80: Avoid writing artifacts during testsWriting
./test.wavin tests pollutes the workspace and can cause conflicts in parallel runs. Gate it behind an env var or write to a temp dir.Apply:
- let wav = hound::WavSpec { - channels: 1, - sample_rate: 16000, - bits_per_sample: 32, - sample_format: hound::SampleFormat::Float, - }; - let mut writer = hound::WavWriter::create("./test.wav", wav).unwrap(); - for sample in all_audio_from_vad { - writer.write_sample(sample).unwrap(); - } + if std::env::var("WRITE_VAD_TEST_WAV").is_ok() { + let wav = hound::WavSpec { + channels: 1, + sample_rate: 16000, + bits_per_sample: 32, + sample_format: hound::SampleFormat::Float, + }; + let mut writer = hound::WavWriter::create("./test.wav", wav).unwrap(); + for sample in all_audio_from_vad { + writer.write_sample(sample).unwrap(); + } + }crates/transcribe-whisper-local/src/service/streaming.rs (3)
157-160: Rename var to reflect new semantics (speech chunks, not VAD frames).Minor readability nit: the source now yields speech chunks; the variable name can follow suit.
Apply this diff:
- let vad_chunks = audio_source.speech_chunks(redemption_time); + let speech_chunks = audio_source.speech_chunks(redemption_time); - let chunked = hypr_whisper_local::AudioChunkStream(process_vad_stream(vad_chunks, "mixed")); + let chunked = hypr_whisper_local::AudioChunkStream(process_vad_stream(speech_chunks, "mixed"));
176-178: Consistent naming: mic_speech_chunks.Align naming with the new API for clarity.
- let mic_vad_chunks = mic_source.speech_chunks(redemption_time); - hypr_whisper_local::AudioChunkStream(process_vad_stream(mic_vad_chunks, "mic")) + let mic_speech_chunks = mic_source.speech_chunks(redemption_time); + hypr_whisper_local::AudioChunkStream(process_vad_stream(mic_speech_chunks, "mic"))
181-183: Consistent naming: speaker_speech_chunks.Same nit for the speaker side.
- let speaker_vad_chunks = speaker_source.speech_chunks(redemption_time); - hypr_whisper_local::AudioChunkStream(process_vad_stream(speaker_vad_chunks, "speaker")) + let speaker_speech_chunks = speaker_source.speech_chunks(redemption_time); + hypr_whisper_local::AudioChunkStream(process_vad_stream(speaker_speech_chunks, "speaker"))crates/vad/src/continuous.rs (1)
57-69: Consider caching the source stream to avoid rebuilding per poll.Recreating
as_stream()and pinning it on everypoll_nextcan add overhead. Ifas_stream()isn’t a trivial proxy, store it once in the struct and poll that directly.If you choose to refactor, add a
source_streamfield initialized innew()and use it inpoll_nextinstead of re-deriving each time.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
Cargo.toml(1 hunks)crates/chunker/src/chunker.rs(0 hunks)crates/chunker/src/lib.rs(0 hunks)crates/llama/Cargo.toml(1 hunks)crates/transcribe-moonshine/Cargo.toml(1 hunks)crates/transcribe-moonshine/src/service/streaming.rs(4 hunks)crates/transcribe-whisper-local/Cargo.toml(1 hunks)crates/transcribe-whisper-local/src/service/streaming.rs(4 hunks)crates/vad/Cargo.toml(1 hunks)crates/vad/src/continuous.rs(1 hunks)crates/vad/src/lib.rs(1 hunks)crates/whisper-local/Cargo.toml(1 hunks)crates/whisper-local/src/model.rs(8 hunks)plugins/listener/Cargo.toml(1 hunks)
💤 Files with no reviewable changes (2)
- crates/chunker/src/lib.rs
- crates/chunker/src/chunker.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{js,ts,tsx,rs}
⚙️ CodeRabbit Configuration File
**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".
Files:
crates/transcribe-whisper-local/src/service/streaming.rscrates/vad/src/lib.rscrates/transcribe-moonshine/src/service/streaming.rscrates/whisper-local/src/model.rscrates/vad/src/continuous.rs
🧬 Code Graph Analysis (4)
crates/transcribe-whisper-local/src/service/streaming.rs (1)
crates/transcribe-moonshine/src/service/streaming.rs (1)
process_vad_stream(220-302)
crates/vad/src/lib.rs (1)
crates/vad/src/continuous.rs (1)
new(39-54)
crates/transcribe-moonshine/src/service/streaming.rs (1)
crates/transcribe-whisper-local/src/service/streaming.rs (1)
process_vad_stream(274-303)
crates/whisper-local/src/model.rs (1)
crates/transcribe-whisper-local/src/service/streaming.rs (1)
params(96-100)
🔇 Additional comments (20)
crates/llama/Cargo.toml (1)
19-20: Verify the branch name format consistencyThe branch name
update-llama-cpp-2025-08-16appears to use a future date (August 16, 2025). This seems unusual as we're currently in early 2025. Please verify if this is intentional or if it should be2024-08-16instead.crates/whisper-local/Cargo.toml (1)
32-32: LGTM! Migration to crates.io version improves stabilityGood move switching from a Git revision to the published crates.io version 0.15.0. This improves dependency stability and makes builds more reproducible.
crates/whisper-local/src/model.rs (3)
8-8: LGTM! Consistent type migration to WhisperTokenIdThe migration from
WhisperTokentoWhisperTokenIdis consistently applied throughout the file, which aligns with the whisper-rs 0.15.0 API changes.
134-146: LGTM! Proper segment API migrationThe updated segment retrieval logic correctly handles the new API:
- Uses
full_n_segments()for count- Properly handles optional segments with
match- Correctly converts timestamps to seconds with
/100.0division
268-269: LGTM! Consistent f64 precision for timestampsThe migration from
f32tof64for timestamp fields improves precision, which is beneficial for accurate time calculations, especially for longer audio segments.Also applies to: 283-292
crates/transcribe-whisper-local/Cargo.toml (1)
27-27: LGTM! Consistent VAD migrationThe replacement of
hypr-chunkerwithhypr-vadaligns with the workspace-wide migration to the new VAD streaming architecture.crates/vad/Cargo.toml (2)
2-2: LGTM! Clear package rename from chunker to vadThe package rename from "chunker" to "vad" better reflects the module's purpose and aligns with the workspace-wide VAD migration.
9-9: LGTM! Appropriate dev dependency for audio testingAdding
rodioas a dev dependency is appropriate for testing VAD functionality with audio streams.crates/transcribe-moonshine/Cargo.toml (1)
18-18: No Remaining hypr-chunker References – Dependency Swap ApprovedI’ve searched the entire codebase (code, TOML manifests, lockfile) and found no lingering
chunkerorhypr-chunkerreferences. The switch tohypr-vad = { workspace = true }is consistent with the PR objectives and other crates.plugins/listener/Cargo.toml (1)
30-30: Approve dependency update to hypr-vad
Updatedplugins/listener/Cargo.toml(line 30) to use the workspace-widehypr-vadper the VAD migration. Ran a recursive grep for anyhypr-chunkerorchunker::occurrences and found none.Cargo.toml (1)
74-74: Confirm complete migration of VAD crate — LGTMAll automated checks passed:
- No leftover references to
hypr-chunker.- Verified usage of
VadExt::with_vadandspeech_chunksin dependent crates.- No duplicate
vadentries in Cargo.toml.The addition of
hypr-vad = { path = "crates/vad", package = "vad" }to workspace dependencies is correct. Approving the change.
crates/transcribe-moonshine/src/service/streaming.rs (5)
191-193: Mic stream migrated to speech_chunks — consistentThe mic path mirrors the single-channel change; looks good.
196-198: Speaker stream migrated to speech_chunks — consistentThe speaker path mirrors the mic path; looks good.
226-226: Stream bound now returns hypr_vad::AudioChunk — correctThe new bound matches the new chunk type; downstream consumption of
chunk.samplesremains valid.
21-21: Verify.speech_chunksUsage onWebSocketAudioSourceThe optional sanity check confirms:
.speech_chunks(...)is called onWebSocketAudioSourceincrates/transcribe-moonshine/src/service/streaming.rs(lines 174, 191, 196).- The
WebSocketAudioSourcestruct is declared incrates/ws-utils/src/lib.rs(lines 81–84), but there is no directimpl Source for WebSocketAudioSourcein that crate.Before approving, please ensure that
WebSocketAudioSourceimplements theSourcetrait fromhypr_vad, either by
- adding
impl hypr_vad::Source for WebSocketAudioSource { … }incrates/ws-utils/src/lib.rs, or- confirming that a blanket implementation (
impl<T: Source> VadExt for T) inhypr_vadcovers this type via another path.Once
WebSocketAudioSource: Sourceis established, the.speech_chunks(...)extension method will resolve correctly.
174-175: speech_chunks migration: redemption window clamp verifiedThe redemption window in
crates/transcribe-moonshine/src/service/streaming.rsis still clamped to 300–1200 ms (default 500 ms) and passed unchanged intospeech_chunks. The calls foraudio_source,mic_source, andspeaker_sourcenow all use VAD-driven chunking with the correct redemption window:
- Lines 151–155:
let redemption_time = Duration::from_millis(std::cmp::min( std::cmp::max(params.redemption_time_ms.unwrap_or(500), 300), 1200, ));- Line 174 (and similarly at 191, 196):
let vad_chunks = audio_source.speech_chunks(redemption_time);crates/vad/src/lib.rs (1)
1-6: Re-exports establish a clean crate root
pub use continuous::*;andpub use error::*;provide a concise public surface. No unused items evident.crates/transcribe-whisper-local/src/service/streaming.rs (2)
20-20: Switch to hypr_vad import looks correct.The
VadExttrait is used below viaspeech_chunks(...)and compiles cleanly with the new crate path.
279-281: Stream bound updated to hypr_vad::AudioChunk — LGTM.The generic bound now matches
speech_chunks’s item type; downstream mapping remains compatible.crates/vad/src/continuous.rs (1)
38-55: Constructor sets sample rate and 30ms framing — LGTM.Syncing
config.sample_ratewith the source and establishing a 30ms frame size fits Silero’s expectations.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
crates/vad/src/continuous.rs (2)
184-199: Good: chunk emission uses SpeechEnd-provided samples (prevents leading-edge truncation).Emitting chunks from VadTransition::SpeechEnd samples fixes the “start-of-speech cut-off” issue noted in earlier reviews and aligns with continuous VAD semantics.
161-168: Avoid panic in library code: make with_vad fallible and handle it in speech_chunks.Unwrapping here will crash callers if VAD construction fails. Propagate the error and emit a one-shot error stream from speech_chunks instead.
Apply this diff:
@@ pub trait VadExt: AsyncSource + Sized { - fn with_vad(self, config: VadConfig) -> ContinuousVadStream<Self> + fn with_vad(self, config: VadConfig) -> Result<ContinuousVadStream<Self>, crate::Error> where Self: Unpin, { - ContinuousVadStream::new(self, config).unwrap() + ContinuousVadStream::new(self, config) } @@ - fn speech_chunks( + fn speech_chunks( self, redemption_time: Duration, - ) -> impl Stream<Item = Result<AudioChunk, crate::Error>> + ) -> Pin<Box<dyn Stream<Item = Result<AudioChunk, crate::Error>>>> where Self: Unpin + 'static, { let config = VadConfig { redemption_time, pre_speech_pad: redemption_time, post_speech_pad: Duration::from_millis(0), min_speech_time: Duration::from_millis(50), ..Default::default() }; - self.with_vad(config).filter_map(|item| { - future::ready(match item { - Ok(VadStreamItem::SpeechEnd { - samples, - start_timestamp_ms, - end_timestamp_ms, - }) => Some(Ok(AudioChunk { - samples, - start_timestamp_ms, - end_timestamp_ms, - })), - Ok(_) => None, - Err(e) => Some(Err(e)), - }) - }) + match self.with_vad(config) { + Ok(vad) => Box::pin(vad.filter_map(|item| { + future::ready(match item { + Ok(VadStreamItem::SpeechEnd { + samples, + start_timestamp_ms, + end_timestamp_ms, + }) => Some(Ok(AudioChunk { + samples, + start_timestamp_ms, + end_timestamp_ms, + })), + Ok(_) => None, + Err(e) => Some(Err(e)), + }) + })), + Err(e) => Box::pin(futures_util::stream::once(future::ready(Err(e)))), + } } }Note: This changes the extension trait’s API; ensure downstream call sites are updated accordingly.
Also applies to: 169-199
🧹 Nitpick comments (2)
crates/vad/src/continuous.rs (2)
44-47: Avoid float rounding for chunk size; use integer math.Floating-point conversion can introduce off-by-one sample drift across long streams. Compute 30ms frames with integer arithmetic.
- let chunk_duration = Duration::from_millis(30); - let chunk_samples = (chunk_duration.as_secs_f64() * config.sample_rate as f64) as usize; + const CHUNK_MS: u64 = 30; + // 30ms worth of samples at the configured sample rate + let chunk_samples = ((config.sample_rate as u64) * CHUNK_MS / 1000) as usize;
84-116: Duplicate chunk-processing logic; consider DRYing with a helper.The “process -> enqueue AudioSamples + transitions -> return first item” sequence appears twice. Extract to a private helper to reduce branching and keep behavior consistent.
Example shape:
impl<S: AsyncSource + Unpin> ContinuousVadStream<S> { fn process_and_enqueue(&mut self, chunk: Vec<f32>) -> Result<Option<VadStreamItem>, crate::Error> { match self.vad_session.process(&chunk) { Ok(transitions) => { self.pending_items.push_back(VadStreamItem::AudioSamples(chunk)); for t in transitions { match t { VadTransition::SpeechStart { timestamp_ms } => self.pending_items.push_back(VadStreamItem::SpeechStart { timestamp_ms }), VadTransition::SpeechEnd { start_timestamp_ms, end_timestamp_ms, samples } => self.pending_items.push_back(VadStreamItem::SpeechEnd { start_timestamp_ms, end_timestamp_ms, samples }), } } Ok(self.pending_items.pop_front()) } Err(e) => Err(crate::Error::VadProcessingFailed(e.to_string())), } } }Then use it in both sites to return Poll::Ready(Some(Ok(item))) or Err accordingly.
Also applies to: 126-157
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
crates/vad/src/continuous.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{js,ts,tsx,rs}
⚙️ CodeRabbit Configuration File
**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".
Files:
crates/vad/src/continuous.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: ci (windows, windows-latest)
- GitHub Check: ci (macos, macos-latest)
🔇 Additional comments (3)
crates/vad/src/continuous.rs (3)
69-71: Confirm AsyncSource::as_stream semantics; recreating the stream every poll.poll_next constructs a fresh pinned stream view on each call. This is fine if as_stream returns a lightweight borrowed view into S and doesn’t maintain internal waker/state. If it constructs a stateful stream, this can regress wakeups or waste work.
- Please confirm as_stream returns a borrow-based, stateless view into the underlying source. If not, consider storing a pinned stream inside ContinuousVadStream to avoid recreations.
169-199: End-of-stream speech finalization: verify last chunk isn’t lost if stream ends mid-speech.If the input ends without enough trailing silence to trigger a VadTransition::SpeechEnd, the final speech segment may never be emitted by speech_chunks. Ensure silero-rs finalizes on EOF, or add a flush strategy (e.g., feeding synthetic silence or a library-provided finalize).
- Please confirm via tests that a speech segment that ends at EOS still yields a SpeechEnd event and chunk.
12-23: API shape looks solid.Public stream items and chunk struct provide the right granularity for downstream services and testing.
Also applies to: 25-30
Summary by cubic
Refactored audio processing to a new VAD crate with a continuous stream that emits audio and speech events, and updated transcription services to use a simpler speech_chunks API. Also bumped llama and whisper dependencies and adapted code to the new whisper API.
New Features
Refactors