Fix OpenAI Realtime API transcription test by yujonglee · Pull Request #2127 · fastrepl/char

yujonglee · 2025-12-05T01:13:47Z

Fix OpenAI Realtime API transcription test

Summary

Fixes the failing OpenAI Realtime API transcription test by implementing the correct API protocol for transcription-only sessions.

Key changes:

Use intent=transcription URL parameter instead of model parameter for transcription sessions
Wrap audio in base64-encoded JSON events (input_audio_buffer.append) instead of raw binary WebSocket messages
Add audio_to_message method to RealtimeSttAdapter trait with default binary passthrough for backward compatibility
Configure 24kHz sample rate (required by OpenAI) via params.sample_rate
Add speech_started/speech_stopped event handlers for debugging

Updates since last revision

Removed unused interleave_audio call in ListenClientDualIO::to_input (code review feedback)

Review & Testing Checklist for Human

Verify other adapters still work: The changes to live.rs modify how audio streams are transformed before sending to WebSocket. Run tests for Deepgram, AssemblyAI, and Soniox adapters to ensure the default audio_to_message implementation (raw binary) maintains backward compatibility
Review dual channel handling: The TransformedDualInput type in live.rs:145-162 carries pre-transformed messages - verify this logic is correct for native multichannel providers
Validate OpenAI API format: Confirm the input_audio_buffer.append JSON structure with base64 audio matches OpenAI's expected format per their docs

Recommended test plan:

# Run OpenAI test (requires API key, ~30s timeout needed)
TEST_TIMEOUT_SECS=30 OPENAI_API_KEY="..." cargo test -p owhisper-client adapter::openai::live::tests::test_build_single -- --ignored --nocapture

# Verify other adapters still work
cargo test -p owhisper-client adapter::deepgram
cargo test -p owhisper-client adapter::assemblyai

Notes

The test requires TEST_TIMEOUT_SECS=30 because OpenAI's VAD needs time to detect speech boundaries before returning transcription
Added base64 dependency (v0.22.1, matching workspace version)

Link to Devin run: https://app.devin.ai/sessions/0e8cdca88bb14e52a1b645f66978d1f7
Requested by: yujonglee (@yujonglee)

- Add intent=transcription to WebSocket URL for transcription-only sessions - Add session.type = transcription in session.update payload - Implement audio_to_message method to wrap audio in base64-encoded JSON events - Add InputAudioBufferAppend struct for proper audio event serialization - Update live.rs to transform audio stream before passing to WebSocket client - Add configurable sample rate support (OpenAI requires 24kHz PCM) - Add speech_started and speech_stopped event handlers for better debugging - Add base64 dependency for audio encoding Co-Authored-By: yujonglee <yujonglee.dev@gmail.com>

devin-ai-integration · 2025-12-05T01:13:50Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR that start with 'DevinAI' or '@devin'.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

netlify · 2025-12-05T01:13:53Z

✅ Deploy Preview for hyprnote ready!

Name	Link
🔨 Latest commit	`e1835b4`
🔍 Latest deploy log	https://app.netlify.com/projects/hyprnote/deploys/693234d9e9fa1b0008eaefd0
😎 Deploy Preview	https://deploy-preview-2127--hyprnote.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

netlify · 2025-12-05T01:13:53Z

✅ Deploy Preview for hyprnote-storybook ready!

Name	Link
🔨 Latest commit	`e1835b4`
🔍 Latest deploy log	https://app.netlify.com/projects/hyprnote-storybook/deploys/693234d9bbfe110008677d9e
😎 Deploy Preview	https://deploy-preview-2127--hyprnote-storybook.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

coderabbitai · 2025-12-05T01:14:08Z

📝 Walkthrough

Walkthrough

Adds an audio-to-message conversion hook to RealtimeSttAdapter, updates OpenAI adapter to emit base64 JSON audio payloads and intent-based WS URLs, refactors live client to send transformed Message objects (TransformedInput/TransformedDualInput), adds base64 dependency and rate-aware test helpers.

Changes

Cohort / File(s)	Summary
Dependency Management `owhisper/owhisper-client/Cargo.toml`	Added `base64 = "0.22.1"` dependency
Adapter Trait Extension `owhisper/owhisper-client/src/adapter/mod.rs`	Added trait method `audio_to_message(&self, audio: bytes::Bytes) -> Message` with default `Message::Binary(audio)`
OpenAI Adapter Implementation `owhisper/owhisper-client/src/adapter/openai/live.rs`	Implemented `audio_to_message` to produce base64-encoded `InputAudioBufferAppend` JSON wrapped as `Message::Text`; made WS URL building ignore ListenParams; replaced hardcoded PCM rate with `params.sample_rate`; added handling for new `InputAudioBufferSpeechStarted/Stopped` events; added debug traces and test rate wiring
OpenAI Adapter Utilities `owhisper/owhisper-client/src/adapter/openai/mod.rs`	Renamed `DEFAULT_MODEL` → `DEFAULT_TRANSCRIPTION_MODEL`; refactored `build_ws_url_from_base` to remove `model` parameter and ensure `intent=transcription` is present
Live Client Refactoring `owhisper/owhisper-client/src/live.rs`	Added `TransformedInput = MixedMessage<Message, ControlMessage>` and `TransformedDualInput = MixedMessage<(bytes::Bytes, bytes::Bytes, Message), ControlMessage>`; changed ListenClientIO / ListenClientDualIO to use transformed types; centralized calls to `adapter.audio_to_message` so audio streams produce Message objects before WS transmission; updated dual/split forwarding to accept adapter
Test Utilities `owhisper/owhisper-client/src/test_utils.rs`	Renamed `sample_rate()` → `default_sample_rate()`; added rate-parameterized helpers `test_audio_stream__with_rate` and `run__test_with_rate`; threaded explicit sample_rate through test runners and streams

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Mic as Microphone (realtime)
participant Adapter as RealtimeSttAdapter
participant Client as ListenClient / live.rs
participant WS as OpenAI WebSocket
participant Parser as OpenAIResponse Parser

Mic->>Client: audio bytes stream
Client->>Adapter: adapter.audio_to_message(bytes)
Adapter-->>Client: Message (Text JSON / Binary)
Client->>WS: send Message over WebSocket
WS-->>Parser: incoming event(s)
Parser-->>Client: OpenAIEvent (including SpeechStarted/Stopped)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay attention to:
- live.rs wiring: ensure all stream type changes (TransformedInput/TransformedDualInput) and conversions maintain expected message/control semantics.
- OpenAI audio encoding: verify base64 serialization, JSON structure of InputAudioBufferAppend, and correct sample_rate usage.
- build_ws_url changes: confirm intent parameter handling and no regressions for existing query parameters.
- Tests: validate updated rate-aware test utilities and adjusted test constants.

Possibly related PRs

feat(owhisper): add OpenAI Realtime API adapter #2126: Overlapping OpenAI adapter changes (build_ws_url, adapter behavior) and related adapter adjustments.
feat: add Gladia realtime STT adapter #2115: Modifications to RealtimeSttAdapter trait and adapter implementations (similar interface additions).
Explicit sample_rate in owhisper client #1651: Adds/propagates explicit sample_rate through ListenParams and related audio handling used in this PR.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main objective: fixing the OpenAI Realtime API transcription test. It is specific, directly related to the primary purpose of the PR.
Description check	✅ Passed	The description comprehensively covers the changes made, including key protocol updates, API modifications, implementation details, and testing guidance. It is well-structured and directly related to the changeset.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch devin/1764895521-fix-openai-realtime-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

owhisper/owhisper-client/src/adapter/openai/live.rs (1)

40-48: Consider handling serialization error instead of .unwrap().

While InputAudioBufferAppend serialization is unlikely to fail, using .unwrap() on line 47 could panic. Consider graceful error handling or document why panic is acceptable here.

     fn audio_to_message(&self, audio: bytes::Bytes) -> Message {
         use base64::Engine;
         let base64_audio = base64::engine::general_purpose::STANDARD.encode(&audio);
         let event = InputAudioBufferAppend {
             event_type: "input_audio_buffer.append".to_string(),
             audio: base64_audio,
         };
-        Message::Text(serde_json::to_string(&event).unwrap().into())
+        // Safe: InputAudioBufferAppend contains only String fields which always serialize
+        Message::Text(serde_json::to_string(&event).expect("InputAudioBufferAppend serialization").into())
     }

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34f8b16 and 51ae407.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

owhisper/owhisper-client/Cargo.toml (1 hunks)
owhisper/owhisper-client/src/adapter/mod.rs (1 hunks)
owhisper/owhisper-client/src/adapter/openai/live.rs (11 hunks)
owhisper/owhisper-client/src/adapter/openai/mod.rs (4 hunks)
owhisper/owhisper-client/src/live.rs (8 hunks)
owhisper/owhisper-client/src/test_utils.rs (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

owhisper/owhisper-client/src/adapter/mod.rs (1)

owhisper/owhisper-client/src/adapter/openai/live.rs (1)

audio_to_message (40-48)

owhisper/owhisper-client/src/adapter/openai/live.rs (3)

owhisper/owhisper-client/src/adapter/mod.rs (2)

build_ws_url (41-41)

audio_to_message (59-61)

owhisper/owhisper-client/src/adapter/openai/mod.rs (1)

build_ws_url_from_base (24-53)

owhisper/owhisper-client/src/test_utils.rs (2)

run_dual_test_with_rate (124-174)

run_single_test_with_rate (74-115)

owhisper/owhisper-client/src/live.rs (2)

crates/ws/tests/client_tests.rs (2)

to_input (24-26)

to_message (28-30)

crates/ws/src/client.rs (2)

to_input (41-41)

to_message (42-42)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Devin
GitHub Check: Redirect rules - hyprnote
GitHub Check: Header rules - hyprnote
GitHub Check: Pages changed - hyprnote
GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
GitHub Check: desktop_ci (macos, depot-macos-14)
GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
GitHub Check: fmt

🔇 Additional comments (20)

owhisper/owhisper-client/Cargo.toml (1)

1-35: Dependency placement and ordering look appropriate.

The base64 dependency is correctly placed in the runtime [dependencies] section rather than [dev-dependencies], and maintains alphabetical ordering within the dependency list. This aligns with the audio-to-message transformation requirements described in the PR summary.

owhisper/owhisper-client/src/adapter/mod.rs (1)

59-61: LGTM! Clean trait extension with a sensible default.

The default implementation returning Message::Binary(audio) maintains backward compatibility for adapters that send raw binary audio, while allowing adapters like OpenAI to override with custom encoding (base64 text).

owhisper/owhisper-client/src/adapter/openai/mod.rs (3)

6-6: LGTM! Clearer constant naming.

Renaming to DEFAULT_TRANSCRIPTION_MODEL better describes its purpose in the transcription context.

60-84: LGTM! Tests properly updated.

Tests cover the main scenarios: empty base (default URL), proxy path, and localhost handling. All assertions correctly reflect the new intent=transcription parameter behavior.

24-53: Parameter intent=transcription correctly aligns with OpenAI Realtime API requirements for transcription sessions.

The implementation is correct. OpenAI's Realtime API officially supports the intent query parameter with transcription as the value for transcription-only sessions (as opposed to conversation mode). The code properly constructs WebSocket URLs with this parameter and prevents duplicate intent parameters. The change from the model parameter to intent=transcription is the appropriate approach for this use case.

owhisper/owhisper-client/src/adapter/openai/live.rs (6)

19-30: LGTM! Simplified URL construction.

Ignoring unused parameters and delegating to build_ws_url_from_base keeps the implementation clean. Query parameters are properly appended.

61-74: LGTM! Dynamic sample rate configuration.

Using params.sample_rate instead of a hardcoded value allows proper rate handling. Fallback to DEFAULT_TRANSCRIPTION_MODEL is appropriate.

130-137: LGTM! New event handling for VAD events.

Handling InputAudioBufferSpeechStarted and InputAudioBufferSpeechStopped events with debug tracing improves observability of the transcription flow.

250-255: LGTM! New struct for audio append events.

InputAudioBufferAppend correctly models the OpenAI event structure with proper serde rename for type field.

274-277: LGTM! Enum variants for speech detection events.

New variants properly map to OpenAI's input_audio_buffer.speech_started and input_audio_buffer.speech_stopped event types.

359-395: LGTM! Tests properly parameterized with sample rate.

Using OPENAI_SAMPLE_RATE = 24000 constant and passing it consistently to ListenParams and run_*_test_with_rate functions ensures OpenAI's required 24kHz sample rate is used.

owhisper/owhisper-client/src/test_utils.rs (4)

26-33: LGTM! Clean backward-compatible refactor.

Renaming to default_sample_rate() and having existing functions delegate to rate-aware variants maintains API compatibility while adding flexibility.

35-48: LGTM! Rate-aware audio stream generation.

Parameterizing sample_rate allows tests to match provider-specific requirements (e.g., OpenAI's 24kHz).

74-115: LGTM! Rate-aware test runner for single-channel tests.

Properly uses the sample_rate parameter to generate the appropriate test audio stream.

124-174: LGTM! Rate-aware test runner for dual-channel tests.

Mirrors the single-channel implementation correctly for dual-channel scenarios.

owhisper/owhisper-client/src/live.rs (5)

116-117: LGTM! Clear type alias for transformed messages.

TransformedInput properly represents the audio data after adapter transformation, maintaining the MixedMessage pattern for audio vs control messages.

146-147: LGTM! TransformedDualInput captures both raw and transformed data.

The tuple (bytes::Bytes, bytes::Bytes, Message) allows passing both raw mic/speaker audio (for potential interleaving) and the pre-transformed message.

205-216: LGTM! Audio stream transformation centralized through adapter.

Cloning the adapter for the closure and mapping each audio message through audio_to_message cleanly centralizes the encoding logic.

257-270: LGTM! Native multichannel transformation with interleaving.

The transformation correctly interleaves mic/speaker audio before calling audio_to_message, then packages all three pieces into TransformedDualInput.

356-376: LGTM! Split-channel forwarding with per-channel transformation.

Each channel's audio is independently transformed via adapter.audio_to_message, correctly handling the split WebSocket case.

owhisper/owhisper-client/src/live.rs

Co-Authored-By: yujonglee <yujonglee.dev@gmail.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

owhisper/owhisper-client/src/live.rs (2)
146-162: Consider removing unused tuple elements from TransformedDualInput.

The TransformedDualInput type carries the original mic and speaker bytes in a tuple alongside the transformed Message (line 146), but these bytes are immediately discarded in to_input at line 157. This means every dual audio chunk carries unnecessary data through the channel.

Simplify the type to avoid the overhead:
-pub type TransformedDualInput = MixedMessage<(bytes::Bytes, bytes::Bytes, Message), ControlMessage>;
+pub type TransformedDualInput = MixedMessage<Message, ControlMessage>;
Then update line 259 in from_realtime_audio_native:
-                TransformedDualInput::Audio((mic, speaker, msg))
+                TransformedDualInput::Audio(msg)
And simplify lines 157-159:
-            TransformedDualInput::Audio((_, _, transform_fn_result)) => {
-                TransformedInput::Audio(transform_fn_result)
+            TransformedDualInput::Audio(msg) => {
+                TransformedInput::Audio(msg)
357-371: Consider consistent error handling or logging for dropped messages.

The function silently ignores send errors at lines 362-363 (try_send) and 366-367 (send). While acceptable in a spawned task with no error propagation path, you might want to:

Use consistent send methods (all try_send or all send().await)

Log dropped audio or control messages for debugging

Add a comment explaining the intentional silent drop

Current behavior is acceptable since channel closure typically indicates WebSocket termination, making message delivery moot.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51ae407 and e1835b4.

📒 Files selected for processing (1)

owhisper/owhisper-client/src/live.rs (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

owhisper/owhisper-client/src/live.rs (3)

crates/ws/src/client.rs (2)

to_input (41-41)

to_message (42-42)

owhisper/owhisper-client/src/adapter/parsing.rs (1)

speaker (87-90)

owhisper/owhisper-client/src/lib.rs (1)

adapter (56-63)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Redirect rules - hyprnote-storybook
GitHub Check: Header rules - hyprnote-storybook
GitHub Check: Pages changed - hyprnote-storybook
GitHub Check: Redirect rules - hyprnote
GitHub Check: Header rules - hyprnote
GitHub Check: Pages changed - hyprnote
GitHub Check: fmt
GitHub Check: desktop_ci (linux, depot-ubuntu-24.04-8)
GitHub Check: desktop_ci (macos, depot-macos-14)
GitHub Check: desktop_ci (linux, depot-ubuntu-22.04-8)
GitHub Check: Devin

🔇 Additional comments (6)

owhisper/owhisper-client/src/live.rs (6)

116-116: LGTM!

The TransformedInput type alias clearly represents the post-transformation state where audio data has been converted to a Message by the adapter.

121-136: LGTM!

The ListenClientIO implementation correctly handles the new TransformedInput type. Audio messages are already transformed by the adapter, so they're passed through directly, while control messages are serialized to JSON.

200-211: LGTM!

The transformation logic correctly routes audio through adapter.audio_to_message() before WebSocket transmission, while control messages pass through unchanged. The adapter clone is necessary for the closure.

252-265: LGTM!

The native multichannel transformation correctly interleaves mic and speaker audio before passing it through adapter.audio_to_message(). The past review concern about redundant interleave_audio calls has been addressed—interleaving now occurs only once here (line 257), not in to_input.

290-311: LGTM!

The split-path channel setup correctly uses TransformedInput types and passes the adapter to forward_dual_to_single for audio transformation.

351-356: LGTM!

The updated signature correctly parameterizes the function with the adapter and uses TransformedInput channel types, enabling audio transformation in the split-path scenario.

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

owhisper/owhisper-client/src/live.rs Show resolved Hide resolved

Remove unused interleave_audio call in ListenClientDualIO::to_input

e1835b4

Co-Authored-By: yujonglee <yujonglee.dev@gmail.com>

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

yujonglee merged commit 5a56b7a into main Dec 5, 2025
15 of 16 checks passed

yujonglee deleted the devin/1764895521-fix-openai-realtime-test branch December 5, 2025 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OpenAI Realtime API transcription test#2127

Fix OpenAI Realtime API transcription test#2127
yujonglee merged 2 commits intomainfrom
devin/1764895521-fix-openai-realtime-test

yujonglee commented Dec 5, 2025 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot commented Dec 5, 2025

Uh oh!

netlify bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

netlify bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 5, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yujonglee commented Dec 5, 2025 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix OpenAI Realtime API transcription test

Summary

Updates since last revision

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Dec 5, 2025

🤖 Devin AI Engineer

Uh oh!

netlify bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for hyprnote ready!

Uh oh!

netlify bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for hyprnote-storybook ready!

Uh oh!

coderabbitai bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yujonglee commented Dec 5, 2025 •

edited by devin-ai-integration bot

Loading

netlify bot commented Dec 5, 2025 •

edited

Loading

netlify bot commented Dec 5, 2025 •

edited

Loading

coderabbitai bot commented Dec 5, 2025 •

edited

Loading