Implement binary diarization by yujonglee · Pull Request #1015 · fastrepl/char

yujonglee · 2025-06-28T01:11:44Z

Resolves #1013

coderabbitai · 2025-06-28T01:12:37Z

📝 Walkthrough

Walkthrough

This change introduces support for dual audio (microphone and speaker) input streams throughout the listening and transcription pipeline. New enum variants, struct fields, and methods are added to represent and handle dual audio. The WebSocket and client infrastructure is refactored to support both single and dual audio modes, including new builder methods and client types. Metadata handling is also updated for consistency.

Changes

File(s)	Change Summary
plugins/listener-interface/src/lib.rs	Added `AudioMode` enum, `DualAudio` variant to `ListenInputChunk`, `meta` field to `ListenOutputChunk`, `audio_mode` to `ListenParams`, and derived `Default` for relevant structs.
plugins/listener-interface/Cargo.toml, plugins/listener/Cargo.toml, crates/ws-utils/Cargo.toml	Added/updated dependencies: `strum`, `voice_activity_detector`, `hypr-audio-utils`, and updated features for `specta`.
crates/audio-utils/src/lib.rs	Added `bytes_to_f32_samples` function for converting raw audio bytes to normalized `f32` samples.
crates/ws-utils/src/lib.rs	Replaced manual audio conversion with `bytes_to_f32_samples`; added support for `DualAudio` input by mixing mic and speaker streams.
crates/ws/src/client.rs, crates/whisper-cloud/src/client.rs	Extended `WebSocketIO` trait with associated type `Data`; updated method signatures for more flexible data handling.
plugins/listener/src/client.rs	Added `ListenClientDual` struct and builder; implemented dual audio WebSocket client logic; updated single audio client build method and trait implementations.
plugins/listener/src/fsm.rs	Changed client builder usage to `.build_single()`; extracted `meta` from results for future use.
plugins/local-stt/src/server.rs	Split websocket handler into `websocket_single_channel` and `websocket_dual_channel`; dispatched based on `audio_mode`; included `meta` in output.
apps/app/server/src/native/listen/realtime.rs	Added match arm for `ListenInputChunk::DualAudio` (currently unimplemented).
crates/stt/src/realtime/clova.rs, crates/stt/src/realtime/deepgram.rs, crates/stt/src/realtime/whisper.rs	Used struct update syntax (`..Default::default()`) for `ListenOutputChunk` to ensure all fields are initialized.
crates/whisper-local/src/model.rs, crates/whisper-local/src/stream.rs	Renamed `metadata` to `meta` in structs and methods; changed metadata handling from reference to owned value and simplified assignment logic.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ListenerInterface
    participant Listener
    participant WSUtils
    participant LocalSTT

    Client->>ListenerInterface: Send ListenParams (audio_mode: Single/Dual)
    Client->>Listener: Open WebSocket (audio_mode param)
    Client->>Listener: Stream ListenInputChunk::SingleAudio or DualAudio
    Listener->>WSUtils: Convert input chunk(s) to f32 samples
    alt SingleAudio
        WSUtils->>LocalSTT: Forward single audio samples
    else DualAudio
        WSUtils->>LocalSTT: Mix mic/speaker samples, forward mixed samples
    end
    LocalSTT->>Listener: Stream ListenOutputChunk (with meta)
    Listener->>Client: Send ListenOutputChunk (with meta)

Assessment against linked issues

Objective (Issue #)	Addressed	Explanation
Support dual audio (mic/speaker) input handling (#1013)	✅
Update interfaces and data structures for dual audio (#1013)	✅
Ensure metadata is handled and propagated (#1013)	✅
Refactor WebSocket and client logic for audio modes (#1013)	✅

Assessment against linked issues: Out-of-scope changes

Code Change	Explanation
Addition of `voice_activity_detector` dependency in root and plugin Cargo.toml files	The linked issue does not mention voice activity detection or related functionality; this appears preparatory or unrelated.
Addition of `strum` dependency and `AudioMode` enum deriving `AsRefStr` (plugins/listener-interface/Cargo.toml, src/lib.rs)	The `AsRefStr` trait derivation is not required by the linked issue, but is a minor, non-functional addition.
Addition of `hypr-audio-utils` dependency (crates/ws-utils/Cargo.toml)	The linked issue does not mention this utility; its addition is not directly tied to the stated objectives.

✨ Finishing Touches

📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

.vscode/settings.json (1)

16-16: Consider the impact of disabling macro-error diagnostics.

While this change may suppress noise from the structural changes, disabling diagnostics can hide real macro-related issues. Ensure this is truly necessary and consider re-enabling once the structural changes stabilize.
crates/whisper-local/src/stream.rs (1)
191-193: Consider the performance impact of cloning metadata for each segment.

The current implementation clones the metadata for every segment in the batch. If metadata objects are large or segments are numerous, this could impact performance.

Consider using Rc<Option<serde_json::Value>> or similar reference-counted approach if performance becomes an issue:
-                for segment in &mut segments {
-                    segment.meta = meta.clone();
-                }
+                let shared_meta = meta.map(|m| std::rc::Rc::new(m));
+                for segment in &mut segments {
+                    segment.meta = shared_meta.as_ref().map(|rc| (**rc).clone());
+                }
However, the current approach is simpler and likely acceptable for most use cases.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1bc7d01 and f8a37ea.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

.vscode/settings.json (1 hunks)
Cargo.toml (1 hunks)
crates/stt/src/realtime/clova.rs (1 hunks)
crates/stt/src/realtime/deepgram.rs (1 hunks)
crates/stt/src/realtime/whisper.rs (1 hunks)
crates/whisper-local/src/model.rs (2 hunks)
crates/whisper-local/src/stream.rs (4 hunks)
plugins/listener-interface/Cargo.toml (1 hunks)
plugins/listener-interface/src/lib.rs (2 hunks)
plugins/listener/Cargo.toml (1 hunks)
plugins/listener/src/fsm.rs (1 hunks)
plugins/local-stt/src/server.rs (3 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".

**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".

⚙️ Source: CodeRabbit Configuration File

List of files the instruction was applied to:

crates/stt/src/realtime/clova.rs
plugins/listener/src/fsm.rs
crates/stt/src/realtime/whisper.rs
crates/stt/src/realtime/deepgram.rs
plugins/local-stt/src/server.rs
crates/whisper-local/src/model.rs
plugins/listener-interface/src/lib.rs
crates/whisper-local/src/stream.rs

🧬 Code Graph Analysis (3)

crates/whisper-local/src/model.rs (1)

crates/whisper-local/src/stream.rs (2)

meta (22-22)

meta (36-38)

plugins/listener-interface/src/lib.rs (3)

crates/whisper-local/src/model.rs (1)

meta (266-268)

crates/whisper-local/src/stream.rs (2)

meta (22-22)

meta (36-38)

plugins/listener/src/client.rs (1)

serde_json (87-87)

crates/whisper-local/src/stream.rs (1)

crates/whisper-local/src/model.rs (1)

meta (266-268)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: ci (macos, macos-latest)
GitHub Check: ci (windows, windows-latest)
GitHub Check: ci

🔇 Additional comments (18)

Cargo.toml (1)

191-191: LGTM - New voice activity detection dependency added.

The addition of voice_activity_detector aligns with the PR's binarydiarization implementation. The dependency follows the workspace pattern correctly.

Please verify that version 0.2.0 is the latest stable version:
What is the latest version of the voice_activity_detector Rust crate?
crates/whisper-local/src/model.rs (2)

242-242: LGTM - Field renamed for consistency.

The rename from metadata to meta aligns with the broader refactoring across the codebase.

266-268: LGTM - Method updated to return owned value.

The method rename and change to return Option<serde_json::Value> (owned) instead of Option<&serde_json::Value> (reference) is consistent with the trait definition in crates/whisper-local/src/stream.rs. The clone() correctly creates an owned copy.

Note: This is a breaking change for any external consumers of this method.

plugins/listener/Cargo.toml (1)

57-59: LGTM - Voice activity detector dependency added.

The workspace dependency addition is correctly implemented and aligns with the new functionality being introduced.

plugins/listener-interface/Cargo.toml (1)

16-16: LGTM - Added serde_json feature for metadata support.

Adding the "serde_json" feature to specta is necessary to support serialization of the new meta field of type Option<serde_json::Value>.

crates/stt/src/realtime/clova.rs (1)

39-39: LGTM: Proper default initialization pattern.

The addition of ..Default::default() ensures all fields of ListenOutputChunk are properly initialized, which is especially important with the new meta field. This aligns with similar changes across other STT implementations.

crates/stt/src/realtime/whisper.rs (1)

33-33: LGTM: Consistent default initialization.

The addition of ..Default::default() maintains consistency with other STT implementations and ensures proper field initialization for the ListenOutputChunk struct.

crates/stt/src/realtime/deepgram.rs (1)

85-88: LGTM: Consistent default initialization pattern.

The restructured ListenOutputChunk construction using ..Default::default() maintains consistency with other STT implementations and ensures all fields are properly initialized.

plugins/local-stt/src/server.rs (2)

155-155: LGTM: Consistent default initialization.

Using ..Default::default() for SimpleAudioChunk construction follows the same good pattern applied throughout the codebase.

170-182: LGTM: Proper metadata extraction and propagation.

The metadata is correctly extracted from the chunk and properly included in the ListenOutputChunk, completing the metadata handling pipeline from transcription to output.

plugins/listener-interface/src/lib.rs (3)

19-19: LGTM: Default derivation enables easier struct initialization.

The addition of Default trait derivation allows for convenient initialization using ..Default::default() pattern, which aligns with the metadata handling improvements across the codebase.

40-40: LGTM: Consistent Default derivation for output structure.

Adding Default to ListenOutputChunk enables the same initialization pattern and supports the new optional meta field.

42-42: LGTM: Optional metadata field maintains backward compatibility.

The new meta field is appropriately optional, ensuring existing code continues to work while enabling metadata propagation through the transcription pipeline.

crates/whisper-local/src/stream.rs (5)

22-22: LGTM: Consistent naming and ownership model for metadata.

The method rename from metadata() to meta() and the change to return owned Option<serde_json::Value> instead of borrowed values creates a cleaner, more consistent API that aligns with the pattern used in crates/whisper-local/src/model.rs.

28-28: LGTM: Field rename maintains consistency.

The field rename from metadata to meta aligns with the trait method rename and creates consistent naming throughout the codebase.

36-38: LGTM: Proper implementation of owned metadata return.

The implementation correctly returns a cloned value to match the new trait signature. The cloning is necessary to convert from the stored Option<serde_json::Value> to the owned return type.

156-156: LGTM: Updated method call reflects API changes.

The change from chunk.metadata() to chunk.meta() correctly uses the renamed method from the AudioChunk trait.

180-180: LGTM: Parameter rename maintains consistency.

The parameter rename from metadata to meta aligns with the broader naming convention changes throughout the codebase.

plugins/listener/src/fsm.rs

coderabbitai

Actionable comments posted: 5

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8a37ea and e9afcb4.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

apps/app/server/src/native/listen/realtime.rs (1 hunks)
crates/audio-utils/src/lib.rs (1 hunks)
crates/whisper-cloud/src/client.rs (1 hunks)
crates/ws-utils/Cargo.toml (1 hunks)
crates/ws-utils/src/lib.rs (2 hunks)
crates/ws/src/client.rs (2 hunks)
plugins/listener-interface/Cargo.toml (1 hunks)
plugins/listener-interface/src/lib.rs (4 hunks)
plugins/listener/src/client.rs (7 hunks)
plugins/listener/src/fsm.rs (2 hunks)
plugins/local-stt/src/server.rs (4 hunks)

✅ Files skipped from review due to trivial changes (1)

crates/ws-utils/Cargo.toml

🚧 Files skipped from review as they are similar to previous changes (3)

plugins/listener-interface/Cargo.toml
plugins/listener/src/fsm.rs
plugins/local-stt/src/server.rs

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.{js,ts,tsx,rs}`: 1. No error handling. 2. No unused imports, variables, or functions. 3. For comments, keep it minimal. It should be about "Why", not "What".

**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".

⚙️ Source: CodeRabbit Configuration File

List of files the instruction was applied to:

apps/app/server/src/native/listen/realtime.rs
crates/ws-utils/src/lib.rs
crates/audio-utils/src/lib.rs
crates/whisper-cloud/src/client.rs
crates/ws/src/client.rs
plugins/listener/src/client.rs
plugins/listener-interface/src/lib.rs

🧬 Code Graph Analysis (2)

crates/ws-utils/src/lib.rs (1)

crates/audio-utils/src/lib.rs (1)

bytes_to_f32_samples (46-53)

crates/ws/src/client.rs (2)

crates/whisper-cloud/src/client.rs (1)

to_input (94-96)

plugins/listener/src/client.rs (2)

to_input (119-123)

to_input (147-152)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: ci (macos, macos-latest)
GitHub Check: ci (windows, windows-latest)
GitHub Check: ci

🔇 Additional comments (11)

crates/whisper-cloud/src/client.rs (1)

90-96: LGTM! Clean trait implementation update.

The addition of the associated type Data and corresponding method signature update properly implements the generalized WebSocketIO trait while maintaining the same functionality.

crates/ws-utils/src/lib.rs (1)

37-37: LGTM! Good use of the utility function.

Replacing the manual conversion with bytes_to_f32_samples improves code reuse and maintainability.

crates/ws/src/client.rs (2)

9-17: LGTM! Excellent trait generalization.

The addition of the associated type Data creates a flexible, type-safe interface that can handle different input data types while maintaining the same WebSocket communication pattern.

28-31: LGTM! Consistent method signature update.

The from_audio method correctly uses the new generic T::Data type, enabling support for both single audio (bytes::Bytes) and dual audio (tuple of byte buffers) inputs.

plugins/listener/src/client.rs (3)

32-48: LGTM: Clear method naming and correct audio mode handling.

The renaming to build_single and explicit setting of AudioMode::Single provides good clarity for the dual audio feature.

137-164: LGTM: Correct dual audio WebSocketIO implementation.

The tuple data type and conversion to DualAudio variant properly handles dual audio streams.

181-191: LGTM: Correct dual stream handling.

The zip operation properly synchronizes mic and speaker streams for dual audio processing.

plugins/listener-interface/src/lib.rs (4)

19-19: LGTM: Appropriate Default derive for data structure.

Adding Default to the Word struct enables easier instantiation and testing.

40-42: LGTM: Good extensibility with optional metadata field.

The optional meta field provides flexibility for additional metadata while maintaining backward compatibility.

55-61: LGTM: Correct dual audio variant definition.

The DualAudio variant properly separates mic and speaker channels with appropriate binary serialization.

73-94: LGTM: Well-designed AudioMode enum with sensible defaults.

The AudioMode enum properly supports both single and dual audio modes with Single as a backward-compatible default.

crates/audio-utils/src/lib.rs

apps/app/server/src/native/listen/realtime.rs

crates/ws-utils/src/lib.rs

plugins/listener/src/client.rs

metadata stuffs

f8a37ea

yujonglee changed the title ~~Implement binarydiarization~~ Implement binary diarization Jun 28, 2025

coderabbitai bot reviewed Jun 28, 2025

View reviewed changes

plugins/listener/src/fsm.rs Show resolved Hide resolved

yujonglee added 2 commits June 27, 2025 21:45

single audio works

afc1f2c

rename

e9afcb4

coderabbitai bot reviewed Jun 28, 2025

View reviewed changes

yujonglee merged commit e9afcb4 into main Jul 7, 2025
5 checks passed

yujonglee deleted the binary-diarization branch July 7, 2025 17:08

coderabbitai bot mentioned this pull request Jul 7, 2025

Binary diarization #1102

Merged

yujonglee mentioned this pull request Jul 12, 2025

speaker diarization #956

Closed

coderabbitai bot mentioned this pull request Aug 29, 2025

Whisper custom vocab #1417

Closed

This was referenced Nov 7, 2025

Batch transcribe support #1638

Merged

feat: Linux Support #1659

Closed

coderabbitai bot mentioned this pull request Nov 21, 2025

save per-channel when DEBUG=1 #1766

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement binary diarization#1015

Implement binary diarization#1015
yujonglee merged 3 commits intomainfrom
binary-diarization

yujonglee commented Jun 28, 2025

Uh oh!

coderabbitai bot commented Jun 28, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yujonglee commented Jun 28, 2025

Uh oh!

coderabbitai bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Assessment against linked issues

Assessment against linked issues: Out-of-scope changes

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Jun 28, 2025 •

edited

Loading