Fix OpenAI Realtime API transcription test#2127
Conversation
- Add intent=transcription to WebSocket URL for transcription-only sessions - Add session.type = transcription in session.update payload - Implement audio_to_message method to wrap audio in base64-encoded JSON events - Add InputAudioBufferAppend struct for proper audio event serialization - Update live.rs to transform audio stream before passing to WebSocket client - Add configurable sample rate support (OpenAI requires 24kHz PCM) - Add speech_started and speech_stopped event handlers for better debugging - Add base64 dependency for audio encoding Co-Authored-By: yujonglee <yujonglee.dev@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
✅ Deploy Preview for hyprnote ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
✅ Deploy Preview for hyprnote-storybook ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
📝 WalkthroughWalkthroughAdds an audio-to-message conversion hook to RealtimeSttAdapter, updates OpenAI adapter to emit base64 JSON audio payloads and intent-based WS URLs, refactors live client to send transformed Message objects (TransformedInput/TransformedDualInput), adds base64 dependency and rate-aware test helpers. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
owhisper/owhisper-client/src/adapter/openai/live.rs (1)
40-48: Consider handling serialization error instead of.unwrap().While
InputAudioBufferAppendserialization is unlikely to fail, using.unwrap()on line 47 could panic. Consider graceful error handling or document why panic is acceptable here.fn audio_to_message(&self, audio: bytes::Bytes) -> Message { use base64::Engine; let base64_audio = base64::engine::general_purpose::STANDARD.encode(&audio); let event = InputAudioBufferAppend { event_type: "input_audio_buffer.append".to_string(), audio: base64_audio, }; - Message::Text(serde_json::to_string(&event).unwrap().into()) + // Safe: InputAudioBufferAppend contains only String fields which always serialize + Message::Text(serde_json::to_string(&event).expect("InputAudioBufferAppend serialization").into()) }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
owhisper/owhisper-client/Cargo.toml(1 hunks)owhisper/owhisper-client/src/adapter/mod.rs(1 hunks)owhisper/owhisper-client/src/adapter/openai/live.rs(11 hunks)owhisper/owhisper-client/src/adapter/openai/mod.rs(4 hunks)owhisper/owhisper-client/src/live.rs(8 hunks)owhisper/owhisper-client/src/test_utils.rs(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
owhisper/owhisper-client/src/adapter/mod.rs (1)
owhisper/owhisper-client/src/adapter/openai/live.rs (1)
audio_to_message(40-48)
owhisper/owhisper-client/src/adapter/openai/live.rs (3)
owhisper/owhisper-client/src/adapter/mod.rs (2)
build_ws_url(41-41)audio_to_message(59-61)owhisper/owhisper-client/src/adapter/openai/mod.rs (1)
build_ws_url_from_base(24-53)owhisper/owhisper-client/src/test_utils.rs (2)
run_dual_test_with_rate(124-174)run_single_test_with_rate(74-115)
owhisper/owhisper-client/src/live.rs (2)
crates/ws/tests/client_tests.rs (2)
to_input(24-26)to_message(28-30)crates/ws/src/client.rs (2)
to_input(41-41)to_message(42-42)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Devin
- GitHub Check: Redirect rules - hyprnote
- GitHub Check: Header rules - hyprnote
- GitHub Check: Pages changed - hyprnote
- GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
- GitHub Check: desktop_ci (macos, depot-macos-14)
- GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
- GitHub Check: fmt
🔇 Additional comments (20)
owhisper/owhisper-client/Cargo.toml (1)
1-35: Dependency placement and ordering look appropriate.The
base64dependency is correctly placed in the runtime[dependencies]section rather than[dev-dependencies], and maintains alphabetical ordering within the dependency list. This aligns with the audio-to-message transformation requirements described in the PR summary.owhisper/owhisper-client/src/adapter/mod.rs (1)
59-61: LGTM! Clean trait extension with a sensible default.The default implementation returning
Message::Binary(audio)maintains backward compatibility for adapters that send raw binary audio, while allowing adapters like OpenAI to override with custom encoding (base64 text).owhisper/owhisper-client/src/adapter/openai/mod.rs (3)
6-6: LGTM! Clearer constant naming.Renaming to
DEFAULT_TRANSCRIPTION_MODELbetter describes its purpose in the transcription context.
60-84: LGTM! Tests properly updated.Tests cover the main scenarios: empty base (default URL), proxy path, and localhost handling. All assertions correctly reflect the new
intent=transcriptionparameter behavior.
24-53: Parameterintent=transcriptioncorrectly aligns with OpenAI Realtime API requirements for transcription sessions.The implementation is correct. OpenAI's Realtime API officially supports the
intentquery parameter withtranscriptionas the value for transcription-only sessions (as opposed toconversationmode). The code properly constructs WebSocket URLs with this parameter and prevents duplicate intent parameters. The change from themodelparameter tointent=transcriptionis the appropriate approach for this use case.owhisper/owhisper-client/src/adapter/openai/live.rs (6)
19-30: LGTM! Simplified URL construction.Ignoring unused parameters and delegating to
build_ws_url_from_basekeeps the implementation clean. Query parameters are properly appended.
61-74: LGTM! Dynamic sample rate configuration.Using
params.sample_rateinstead of a hardcoded value allows proper rate handling. Fallback toDEFAULT_TRANSCRIPTION_MODELis appropriate.
130-137: LGTM! New event handling for VAD events.Handling
InputAudioBufferSpeechStartedandInputAudioBufferSpeechStoppedevents with debug tracing improves observability of the transcription flow.
250-255: LGTM! New struct for audio append events.
InputAudioBufferAppendcorrectly models the OpenAI event structure with proper serde rename fortypefield.
274-277: LGTM! Enum variants for speech detection events.New variants properly map to OpenAI's
input_audio_buffer.speech_startedandinput_audio_buffer.speech_stoppedevent types.
359-395: LGTM! Tests properly parameterized with sample rate.Using
OPENAI_SAMPLE_RATE = 24000constant and passing it consistently toListenParamsandrun_*_test_with_ratefunctions ensures OpenAI's required 24kHz sample rate is used.owhisper/owhisper-client/src/test_utils.rs (4)
26-33: LGTM! Clean backward-compatible refactor.Renaming to
default_sample_rate()and having existing functions delegate to rate-aware variants maintains API compatibility while adding flexibility.
35-48: LGTM! Rate-aware audio stream generation.Parameterizing
sample_rateallows tests to match provider-specific requirements (e.g., OpenAI's 24kHz).
74-115: LGTM! Rate-aware test runner for single-channel tests.Properly uses the sample_rate parameter to generate the appropriate test audio stream.
124-174: LGTM! Rate-aware test runner for dual-channel tests.Mirrors the single-channel implementation correctly for dual-channel scenarios.
owhisper/owhisper-client/src/live.rs (5)
116-117: LGTM! Clear type alias for transformed messages.
TransformedInputproperly represents the audio data after adapter transformation, maintaining theMixedMessagepattern for audio vs control messages.
146-147: LGTM! TransformedDualInput captures both raw and transformed data.The tuple
(bytes::Bytes, bytes::Bytes, Message)allows passing both raw mic/speaker audio (for potential interleaving) and the pre-transformed message.
205-216: LGTM! Audio stream transformation centralized through adapter.Cloning the adapter for the closure and mapping each audio message through
audio_to_messagecleanly centralizes the encoding logic.
257-270: LGTM! Native multichannel transformation with interleaving.The transformation correctly interleaves mic/speaker audio before calling
audio_to_message, then packages all three pieces intoTransformedDualInput.
356-376: LGTM! Split-channel forwarding with per-channel transformation.Each channel's audio is independently transformed via
adapter.audio_to_message, correctly handling the split WebSocket case.
Co-Authored-By: yujonglee <yujonglee.dev@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
owhisper/owhisper-client/src/live.rs (2)
146-162: Consider removing unused tuple elements fromTransformedDualInput.The
TransformedDualInputtype carries the originalmicandspeakerbytes in a tuple alongside the transformedMessage(line 146), but these bytes are immediately discarded into_inputat line 157. This means every dual audio chunk carries unnecessary data through the channel.Simplify the type to avoid the overhead:
-pub type TransformedDualInput = MixedMessage<(bytes::Bytes, bytes::Bytes, Message), ControlMessage>; +pub type TransformedDualInput = MixedMessage<Message, ControlMessage>;Then update line 259 in
from_realtime_audio_native:- TransformedDualInput::Audio((mic, speaker, msg)) + TransformedDualInput::Audio(msg)And simplify lines 157-159:
- TransformedDualInput::Audio((_, _, transform_fn_result)) => { - TransformedInput::Audio(transform_fn_result) + TransformedDualInput::Audio(msg) => { + TransformedInput::Audio(msg)
357-371: Consider consistent error handling or logging for dropped messages.The function silently ignores send errors at lines 362-363 (
try_send) and 366-367 (send). While acceptable in a spawned task with no error propagation path, you might want to:
- Use consistent send methods (all
try_sendor allsend().await)- Log dropped audio or control messages for debugging
- Add a comment explaining the intentional silent drop
Current behavior is acceptable since channel closure typically indicates WebSocket termination, making message delivery moot.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
owhisper/owhisper-client/src/live.rs(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
owhisper/owhisper-client/src/live.rs (3)
crates/ws/src/client.rs (2)
to_input(41-41)to_message(42-42)owhisper/owhisper-client/src/adapter/parsing.rs (1)
speaker(87-90)owhisper/owhisper-client/src/lib.rs (1)
adapter(56-63)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Redirect rules - hyprnote-storybook
- GitHub Check: Header rules - hyprnote-storybook
- GitHub Check: Pages changed - hyprnote-storybook
- GitHub Check: Redirect rules - hyprnote
- GitHub Check: Header rules - hyprnote
- GitHub Check: Pages changed - hyprnote
- GitHub Check: fmt
- GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
- GitHub Check: desktop_ci (macos, depot-macos-14)
- GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
- GitHub Check: Devin
🔇 Additional comments (6)
owhisper/owhisper-client/src/live.rs (6)
116-116: LGTM!The
TransformedInputtype alias clearly represents the post-transformation state where audio data has been converted to aMessageby the adapter.
121-136: LGTM!The
ListenClientIOimplementation correctly handles the newTransformedInputtype. Audio messages are already transformed by the adapter, so they're passed through directly, while control messages are serialized to JSON.
200-211: LGTM!The transformation logic correctly routes audio through
adapter.audio_to_message()before WebSocket transmission, while control messages pass through unchanged. The adapter clone is necessary for the closure.
252-265: LGTM!The native multichannel transformation correctly interleaves mic and speaker audio before passing it through
adapter.audio_to_message(). The past review concern about redundantinterleave_audiocalls has been addressed—interleaving now occurs only once here (line 257), not into_input.
290-311: LGTM!The split-path channel setup correctly uses
TransformedInputtypes and passes the adapter toforward_dual_to_singlefor audio transformation.
351-356: LGTM!The updated signature correctly parameterizes the function with the adapter and uses
TransformedInputchannel types, enabling audio transformation in the split-path scenario.
Fix OpenAI Realtime API transcription test
Summary
Fixes the failing OpenAI Realtime API transcription test by implementing the correct API protocol for transcription-only sessions.
Key changes:
intent=transcriptionURL parameter instead ofmodelparameter for transcription sessionsinput_audio_buffer.append) instead of raw binary WebSocket messagesaudio_to_messagemethod toRealtimeSttAdaptertrait with default binary passthrough for backward compatibilityparams.sample_ratespeech_started/speech_stoppedevent handlers for debuggingUpdates since last revision
interleave_audiocall inListenClientDualIO::to_input(code review feedback)Review & Testing Checklist for Human
live.rsmodify how audio streams are transformed before sending to WebSocket. Run tests for Deepgram, AssemblyAI, and Soniox adapters to ensure the defaultaudio_to_messageimplementation (raw binary) maintains backward compatibilityTransformedDualInputtype inlive.rs:145-162carries pre-transformed messages - verify this logic is correct for native multichannel providersinput_audio_buffer.appendJSON structure with base64 audio matches OpenAI's expected format per their docsRecommended test plan:
Notes
TEST_TIMEOUT_SECS=30because OpenAI's VAD needs time to detect speech boundaries before returning transcriptionbase64dependency (v0.22.1, matching workspace version)Link to Devin run: https://app.devin.ai/sessions/0e8cdca88bb14e52a1b645f66978d1f7
Requested by: yujonglee (@yujonglee)