Skip to content

Implement binary diarization#1015

Merged
yujonglee merged 3 commits intomainfrom
binary-diarization
Jul 7, 2025
Merged

Implement binary diarization#1015
yujonglee merged 3 commits intomainfrom
binary-diarization

Conversation

@yujonglee
Copy link
Contributor

Resolves #1013

@yujonglee yujonglee changed the title Implement binarydiarization Implement binary diarization Jun 28, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 28, 2025

📝 Walkthrough

Walkthrough

This change introduces support for dual audio (microphone and speaker) input streams throughout the listening and transcription pipeline. New enum variants, struct fields, and methods are added to represent and handle dual audio. The WebSocket and client infrastructure is refactored to support both single and dual audio modes, including new builder methods and client types. Metadata handling is also updated for consistency.

Changes

File(s) Change Summary
plugins/listener-interface/src/lib.rs Added AudioMode enum, DualAudio variant to ListenInputChunk, meta field to ListenOutputChunk, audio_mode to ListenParams, and derived Default for relevant structs.
plugins/listener-interface/Cargo.toml, plugins/listener/Cargo.toml, crates/ws-utils/Cargo.toml Added/updated dependencies: strum, voice_activity_detector, hypr-audio-utils, and updated features for specta.
crates/audio-utils/src/lib.rs Added bytes_to_f32_samples function for converting raw audio bytes to normalized f32 samples.
crates/ws-utils/src/lib.rs Replaced manual audio conversion with bytes_to_f32_samples; added support for DualAudio input by mixing mic and speaker streams.
crates/ws/src/client.rs, crates/whisper-cloud/src/client.rs Extended WebSocketIO trait with associated type Data; updated method signatures for more flexible data handling.
plugins/listener/src/client.rs Added ListenClientDual struct and builder; implemented dual audio WebSocket client logic; updated single audio client build method and trait implementations.
plugins/listener/src/fsm.rs Changed client builder usage to .build_single(); extracted meta from results for future use.
plugins/local-stt/src/server.rs Split websocket handler into websocket_single_channel and websocket_dual_channel; dispatched based on audio_mode; included meta in output.
apps/app/server/src/native/listen/realtime.rs Added match arm for ListenInputChunk::DualAudio (currently unimplemented).
crates/stt/src/realtime/clova.rs, crates/stt/src/realtime/deepgram.rs, crates/stt/src/realtime/whisper.rs Used struct update syntax (..Default::default()) for ListenOutputChunk to ensure all fields are initialized.
crates/whisper-local/src/model.rs, crates/whisper-local/src/stream.rs Renamed metadata to meta in structs and methods; changed metadata handling from reference to owned value and simplified assignment logic.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ListenerInterface
    participant Listener
    participant WSUtils
    participant LocalSTT

    Client->>ListenerInterface: Send ListenParams (audio_mode: Single/Dual)
    Client->>Listener: Open WebSocket (audio_mode param)
    Client->>Listener: Stream ListenInputChunk::SingleAudio or DualAudio
    Listener->>WSUtils: Convert input chunk(s) to f32 samples
    alt SingleAudio
        WSUtils->>LocalSTT: Forward single audio samples
    else DualAudio
        WSUtils->>LocalSTT: Mix mic/speaker samples, forward mixed samples
    end
    LocalSTT->>Listener: Stream ListenOutputChunk (with meta)
    Listener->>Client: Send ListenOutputChunk (with meta)
Loading

Assessment against linked issues

Objective (Issue #) Addressed Explanation
Support dual audio (mic/speaker) input handling (#1013)
Update interfaces and data structures for dual audio (#1013)
Ensure metadata is handled and propagated (#1013)
Refactor WebSocket and client logic for audio modes (#1013)

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
Addition of voice_activity_detector dependency in root and plugin Cargo.toml files The linked issue does not mention voice activity detection or related functionality; this appears preparatory or unrelated.
Addition of strum dependency and AudioMode enum deriving AsRefStr (plugins/listener-interface/Cargo.toml, src/lib.rs) The AsRefStr trait derivation is not required by the linked issue, but is a minor, non-functional addition.
Addition of hypr-audio-utils dependency (crates/ws-utils/Cargo.toml) The linked issue does not mention this utility; its addition is not directly tied to the stated objectives.
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.vscode/settings.json (1)

16-16: Consider the impact of disabling macro-error diagnostics.

While this change may suppress noise from the structural changes, disabling diagnostics can hide real macro-related issues. Ensure this is truly necessary and consider re-enabling once the structural changes stabilize.

crates/whisper-local/src/stream.rs (1)

191-193: Consider the performance impact of cloning metadata for each segment.

The current implementation clones the metadata for every segment in the batch. If metadata objects are large or segments are numerous, this could impact performance.

Consider using Rc<Option<serde_json::Value>> or similar reference-counted approach if performance becomes an issue:

-                for segment in &mut segments {
-                    segment.meta = meta.clone();
-                }
+                let shared_meta = meta.map(|m| std::rc::Rc::new(m));
+                for segment in &mut segments {
+                    segment.meta = shared_meta.as_ref().map(|rc| (**rc).clone());
+                }

However, the current approach is simpler and likely acceptable for most use cases.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bc7d01 and f8a37ea.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .vscode/settings.json (1 hunks)
  • Cargo.toml (1 hunks)
  • crates/stt/src/realtime/clova.rs (1 hunks)
  • crates/stt/src/realtime/deepgram.rs (1 hunks)
  • crates/stt/src/realtime/whisper.rs (1 hunks)
  • crates/whisper-local/src/model.rs (2 hunks)
  • crates/whisper-local/src/stream.rs (4 hunks)
  • plugins/listener-interface/Cargo.toml (1 hunks)
  • plugins/listener-interface/src/lib.rs (2 hunks)
  • plugins/listener/Cargo.toml (1 hunks)
  • plugins/listener/src/fsm.rs (1 hunks)
  • plugins/local-stt/src/server.rs (3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".

**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".

⚙️ Source: CodeRabbit Configuration File

List of files the instruction was applied to:

  • crates/stt/src/realtime/clova.rs
  • plugins/listener/src/fsm.rs
  • crates/stt/src/realtime/whisper.rs
  • crates/stt/src/realtime/deepgram.rs
  • plugins/local-stt/src/server.rs
  • crates/whisper-local/src/model.rs
  • plugins/listener-interface/src/lib.rs
  • crates/whisper-local/src/stream.rs
🧬 Code Graph Analysis (3)
crates/whisper-local/src/model.rs (1)
crates/whisper-local/src/stream.rs (2)
  • meta (22-22)
  • meta (36-38)
plugins/listener-interface/src/lib.rs (3)
crates/whisper-local/src/model.rs (1)
  • meta (266-268)
crates/whisper-local/src/stream.rs (2)
  • meta (22-22)
  • meta (36-38)
plugins/listener/src/client.rs (1)
  • serde_json (87-87)
crates/whisper-local/src/stream.rs (1)
crates/whisper-local/src/model.rs (1)
  • meta (266-268)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: ci (macos, macos-latest)
  • GitHub Check: ci (windows, windows-latest)
  • GitHub Check: ci
🔇 Additional comments (18)
Cargo.toml (1)

191-191: LGTM - New voice activity detection dependency added.

The addition of voice_activity_detector aligns with the PR's binarydiarization implementation. The dependency follows the workspace pattern correctly.

Please verify that version 0.2.0 is the latest stable version:

What is the latest version of the voice_activity_detector Rust crate?
crates/whisper-local/src/model.rs (2)

242-242: LGTM - Field renamed for consistency.

The rename from metadata to meta aligns with the broader refactoring across the codebase.


266-268: LGTM - Method updated to return owned value.

The method rename and change to return Option<serde_json::Value> (owned) instead of Option<&serde_json::Value> (reference) is consistent with the trait definition in crates/whisper-local/src/stream.rs. The clone() correctly creates an owned copy.

Note: This is a breaking change for any external consumers of this method.

plugins/listener/Cargo.toml (1)

57-59: LGTM - Voice activity detector dependency added.

The workspace dependency addition is correctly implemented and aligns with the new functionality being introduced.

plugins/listener-interface/Cargo.toml (1)

16-16: LGTM - Added serde_json feature for metadata support.

Adding the "serde_json" feature to specta is necessary to support serialization of the new meta field of type Option<serde_json::Value>.

crates/stt/src/realtime/clova.rs (1)

39-39: LGTM: Proper default initialization pattern.

The addition of ..Default::default() ensures all fields of ListenOutputChunk are properly initialized, which is especially important with the new meta field. This aligns with similar changes across other STT implementations.

crates/stt/src/realtime/whisper.rs (1)

33-33: LGTM: Consistent default initialization.

The addition of ..Default::default() maintains consistency with other STT implementations and ensures proper field initialization for the ListenOutputChunk struct.

crates/stt/src/realtime/deepgram.rs (1)

85-88: LGTM: Consistent default initialization pattern.

The restructured ListenOutputChunk construction using ..Default::default() maintains consistency with other STT implementations and ensures all fields are properly initialized.

plugins/local-stt/src/server.rs (2)

155-155: LGTM: Consistent default initialization.

Using ..Default::default() for SimpleAudioChunk construction follows the same good pattern applied throughout the codebase.


170-182: LGTM: Proper metadata extraction and propagation.

The metadata is correctly extracted from the chunk and properly included in the ListenOutputChunk, completing the metadata handling pipeline from transcription to output.

plugins/listener-interface/src/lib.rs (3)

19-19: LGTM: Default derivation enables easier struct initialization.

The addition of Default trait derivation allows for convenient initialization using ..Default::default() pattern, which aligns with the metadata handling improvements across the codebase.


40-40: LGTM: Consistent Default derivation for output structure.

Adding Default to ListenOutputChunk enables the same initialization pattern and supports the new optional meta field.


42-42: LGTM: Optional metadata field maintains backward compatibility.

The new meta field is appropriately optional, ensuring existing code continues to work while enabling metadata propagation through the transcription pipeline.

crates/whisper-local/src/stream.rs (5)

22-22: LGTM: Consistent naming and ownership model for metadata.

The method rename from metadata() to meta() and the change to return owned Option<serde_json::Value> instead of borrowed values creates a cleaner, more consistent API that aligns with the pattern used in crates/whisper-local/src/model.rs.


28-28: LGTM: Field rename maintains consistency.

The field rename from metadata to meta aligns with the trait method rename and creates consistent naming throughout the codebase.


36-38: LGTM: Proper implementation of owned metadata return.

The implementation correctly returns a cloned value to match the new trait signature. The cloning is necessary to convert from the stored Option<serde_json::Value> to the owned return type.


156-156: LGTM: Updated method call reflects API changes.

The change from chunk.metadata() to chunk.meta() correctly uses the renamed method from the AudioChunk trait.


180-180: LGTM: Parameter rename maintains consistency.

The parameter rename from metadata to meta aligns with the broader naming convention changes throughout the codebase.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8a37ea and e9afcb4.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • apps/app/server/src/native/listen/realtime.rs (1 hunks)
  • crates/audio-utils/src/lib.rs (1 hunks)
  • crates/whisper-cloud/src/client.rs (1 hunks)
  • crates/ws-utils/Cargo.toml (1 hunks)
  • crates/ws-utils/src/lib.rs (2 hunks)
  • crates/ws/src/client.rs (2 hunks)
  • plugins/listener-interface/Cargo.toml (1 hunks)
  • plugins/listener-interface/src/lib.rs (4 hunks)
  • plugins/listener/src/client.rs (7 hunks)
  • plugins/listener/src/fsm.rs (2 hunks)
  • plugins/local-stt/src/server.rs (4 hunks)
✅ Files skipped from review due to trivial changes (1)
  • crates/ws-utils/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (3)
  • plugins/listener-interface/Cargo.toml
  • plugins/listener/src/fsm.rs
  • plugins/local-stt/src/server.rs
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".

**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".

⚙️ Source: CodeRabbit Configuration File

List of files the instruction was applied to:

  • apps/app/server/src/native/listen/realtime.rs
  • crates/ws-utils/src/lib.rs
  • crates/audio-utils/src/lib.rs
  • crates/whisper-cloud/src/client.rs
  • crates/ws/src/client.rs
  • plugins/listener/src/client.rs
  • plugins/listener-interface/src/lib.rs
🧬 Code Graph Analysis (2)
crates/ws-utils/src/lib.rs (1)
crates/audio-utils/src/lib.rs (1)
  • bytes_to_f32_samples (46-53)
crates/ws/src/client.rs (2)
crates/whisper-cloud/src/client.rs (1)
  • to_input (94-96)
plugins/listener/src/client.rs (2)
  • to_input (119-123)
  • to_input (147-152)
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: ci (macos, macos-latest)
  • GitHub Check: ci (windows, windows-latest)
  • GitHub Check: ci
🔇 Additional comments (11)
crates/whisper-cloud/src/client.rs (1)

90-96: LGTM! Clean trait implementation update.

The addition of the associated type Data and corresponding method signature update properly implements the generalized WebSocketIO trait while maintaining the same functionality.

crates/ws-utils/src/lib.rs (1)

37-37: LGTM! Good use of the utility function.

Replacing the manual conversion with bytes_to_f32_samples improves code reuse and maintainability.

crates/ws/src/client.rs (2)

9-17: LGTM! Excellent trait generalization.

The addition of the associated type Data creates a flexible, type-safe interface that can handle different input data types while maintaining the same WebSocket communication pattern.


28-31: LGTM! Consistent method signature update.

The from_audio method correctly uses the new generic T::Data type, enabling support for both single audio (bytes::Bytes) and dual audio (tuple of byte buffers) inputs.

plugins/listener/src/client.rs (3)

32-48: LGTM: Clear method naming and correct audio mode handling.

The renaming to build_single and explicit setting of AudioMode::Single provides good clarity for the dual audio feature.


137-164: LGTM: Correct dual audio WebSocketIO implementation.

The tuple data type and conversion to DualAudio variant properly handles dual audio streams.


181-191: LGTM: Correct dual stream handling.

The zip operation properly synchronizes mic and speaker streams for dual audio processing.

plugins/listener-interface/src/lib.rs (4)

19-19: LGTM: Appropriate Default derive for data structure.

Adding Default to the Word struct enables easier instantiation and testing.


40-42: LGTM: Good extensibility with optional metadata field.

The optional meta field provides flexibility for additional metadata while maintaining backward compatibility.


55-61: LGTM: Correct dual audio variant definition.

The DualAudio variant properly separates mic and speaker channels with appropriate binary serialization.


73-94: LGTM: Well-designed AudioMode enum with sensible defaults.

The AudioMode enum properly supports both single and dual audio modes with Single as a backward-compatible default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement binary diarization

1 participant