Skip to content

feat: Add voice input with pluggable backend (Gemini zero-install + Whisper)#18499

Open
fayerman-source wants to merge 62 commits intogoogle-gemini:mainfrom
fayerman-source:feat/voice-input-clean
Open

feat: Add voice input with pluggable backend (Gemini zero-install + Whisper)#18499
fayerman-source wants to merge 62 commits intogoogle-gemini:mainfrom
fayerman-source:feat/voice-input-clean

Conversation

@fayerman-source
Copy link

@fayerman-source fayerman-source commented Feb 7, 2026

Summary

This PR adds native voice input to Gemini CLI with a pluggable backend architecture — defaulting to zero-install Gemini transcription (addressing @jacob314's request), with optional local Whisper support for offline use.

Update: Gemini Backend Added (addresses @jacob314's feedback)

Per @jacob314's comment requesting a zero-install option using the Gemini API, we've added a GeminiRestBackend that:

  • Records audio locally via sox or arecord (standard Linux/macOS tools, no separate install)
  • Converts raw PCM to WAV in-memory (no temp files)
  • Transcribes via generateContent using the user's existing Gemini auth (OAuth or API key)
  • Defaults to gemini-2.0-flash — no additional setup required

Note: We used generateContent with audio rather than the Live API streaming approach. The Live API is designed for real-time bidirectional sessions; generateContent is simpler and sufficient for push-to-talk transcription. Happy to discuss if the Live API approach is preferred.

Features

Feature Description
🎤 Voice Recording Press Alt+R or type /voice to toggle recording
Zero-install backend Gemini transcription via existing auth — no Python, no model downloads
🔌 Pluggable backends voice.provider: "gemini" (default) or "whisper" (local)
🌍 Auto Language Detection Whisper/Gemini auto-detects language
Visual Feedback 🎤 (recording), ⏳ (transcribing)
⌨️ Alternative Shortcut Ctrl+Q for conflicting keybindings
🛠️ Slash commands /voice toggle, /voice set-path <path>, /voice help

Settings

{
  "voice": {
    "enabled": true,                   // Enable voice input (default: false)
    "provider": "gemini",              // "gemini" (default, zero-install) or "whisper"
    "whisperPath": "/usr/bin/whisper"  // Optional: path to Whisper binary
  }
}

Architecture

useVoiceInput (hook)
  ├── GeminiRestBackend   ← default: sox/arecord → WAV buffer → generateContent
  └── LocalWhisperBackend ← optional: sox/arecord → WAV file → whisper binary

Key design decisions:

  • Event-based transcript delivery: EventEmitter instead of React state to prevent re-render cascades
  • In-memory audio: Raw PCM chunks concatenated into WAV buffer — no disk I/O for Gemini path
  • Existing auth: Uses config.getContentGenerator() — works with both OAuth and API key
  • Race condition protection: isTogglingRef guard prevents overlapping toggle calls

Audio Pipeline (Gemini backend)

  1. Record raw PCM via sox (cross-platform) or arecord (Linux fallback)
  2. Concatenate PCM chunks into buffer, prepend 44-byte RIFF/WAV header
  3. Send as audio/wav to generateContent with gemini-2.0-flash
  4. Emit transcript text via EventEmitter → inserted at cursor position

Testing

  • ✅ 8 unit tests covering backend selection, delegation, state, transcript events, error surfacing
  • VoiceContext tests pass
  • ✅ All existing tests unaffected

Closes #18067

- Resolve key binding conflict: Restore Alt+V for paste, use Alt+R for voice (Fixes google-gemini#1234)
- Fix import violation: Use @google/gemini-cli-core for tmpdir
- Add missing documentation for key bindings and settings
- Add tests for VoiceContext and fix types in useVoiceInput tests
- Fix Ctrl+C handling in InputPrompt (regression fix)
- Fix infinite render loop in Voice Input (AppContainer/Composer fix)
…render loop

Replace context-based transcript delivery with EventEmitter pattern:
- Remove transcript from VoiceInputState (delivered via events instead)
- Remove clearTranscript function (no longer needed)
- Add onVoiceTranscript subscription function
- InputPrompt listens for transcript events instead of reading from context
- Add useMemo to prevent context value churn

This eliminates the '6-8 frames rendered while idle' error by:
- Preventing context updates when transcript changes
- Eliminating the set/clear cycle that caused render loops
- Keeping recording state (isRecording, isTranscribing, error) in context

Fixes: VOICE_INFINITE_LOOP_ANALYSIS.md
- Add replication test to verify no infinite render loop
- Add stress test for rapid toggle recording calls
- Add log volume test to ensure no excessive logging
Remove hardcoded --language English from whisper commands to allow
auto-detection of spoken language. This prevents transliteration
(Russian spoken → English letters) and correctly transcribes in
the original language (Russian → Russian text).

Affected whisper variants:
- Configured whisper path
- whisper-faster
- whisper (Python package)
@fayerman-source fayerman-source requested review from a team as code owners February 7, 2026 01:05
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fayerman-source, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Gemini CLI by integrating a native voice input capability. This feature aims to streamline user interaction by enabling natural speech-to-text conversion, making the CLI more accessible and efficient. The implementation focuses on a robust, event-driven design to maintain performance and provides clear feedback to the user throughout the voice input process.

Highlights

  • Native Voice Input: Introduces a native voice input feature for the Gemini CLI, allowing users to speak commands and input text directly.
  • Whisper Integration: Leverages OpenAI Whisper (or compatible alternatives like faster-whisper, whisper-cli) for accurate speech-to-text transcription.
  • Flexible Activation: Users can toggle voice recording using keyboard shortcuts (Alt+R or Ctrl+Q) or the new /voice slash command.
  • Dynamic Language Detection: The system automatically detects the spoken language, removing the need for hardcoded language settings.
  • User Feedback: Provides clear visual indicators (🎤 for recording, ⏳ for transcribing) to inform the user of the voice input status.
  • Optimized Architecture: Implements an event-based architecture for transcript delivery, utilizing EventEmitter to prevent unnecessary React re-renders and ensure UI responsiveness.
  • Robustness: Includes race condition protection (isTogglingRef) and process guards to ensure stable recording and transcription operations.
  • Configurable Backend: Supports various Whisper backends and allows users to specify a custom whisperPath in settings for advanced configurations.
Changelog
  • docs/cli/keyboard-shortcuts.md
    • Added a new 'Voice Input' section detailing the Alt + R and Ctrl + Q keybindings for toggling voice recording.
  • docs/cli/settings.md
    • Introduced a new 'Voice' section with the voice.whisperPath setting, allowing users to specify the path to their Whisper executable.
  • docs/get-started/configuration.md
    • Added documentation for the new voice.whisperPath setting under the voice category.
  • package-lock.json
    • Updated several dependency entries to include "peer": true.
  • packages/cli/src/config/keyBindings.ts
    • Defined a new Command.VOICE_INPUT enum member.
    • Added default keybindings (Alt+R, Ctrl+Q) for the VOICE_INPUT command.
    • Included 'Voice Input' in the commandCategories for better organization.
  • packages/cli/src/config/settingsSchema.ts
    • Added a new voice object to the SETTINGS_SCHEMA to manage voice input configurations, including the whisperPath property.
  • packages/cli/src/services/BuiltinCommandLoader.ts
    • Imported the new voiceCommand.
    • Registered voiceCommand in the list of built-in slash commands.
  • packages/cli/src/test-utils/render.tsx
    • Imported VoiceContext and VoiceInputReturn for testing purposes.
    • Added mockVoiceReturn to provide a default mock for the voice input hook in tests.
    • Modified renderWithProviders to accept and provide a voice mock to the VoiceContext.Provider.
  • packages/cli/src/ui/AppContainer.tsx
    • Imported VoiceContext and useVoiceInput.
    • Integrated the useVoiceInput hook to manage voice recording state and actions.
    • Updated the UIActionsContext value to include toggleVoice.
    • Added keybinding handling for Command.VOICE_INPUT to trigger voice.toggleRecording().
    • Wrapped the main App component with VoiceContext.Provider to make voice input functionality available throughout the UI.
  • packages/cli/src/ui/commands/types.ts
    • Added toggleVoice: () => void; to the CommandContext['ui'] interface.
  • packages/cli/src/ui/commands/voiceCommand.ts
    • New file: Implemented the /voice slash command, which toggles voice recording and provides help documentation when /voice help is used.
  • packages/cli/src/ui/components/Composer.tsx
    • Added a key="input-prompt" prop to the InputPrompt component.
  • packages/cli/src/ui/components/InputPrompt.tsx
    • Imported useVoiceContext and onVoiceTranscript.
    • Integrated useVoiceContext to access voice input state.
    • Added a useEffect hook to subscribe to onVoiceTranscript events, inserting transcribed text into the buffer.
    • Modified keybinding handling for Escape and Quit to stop voice recording if active.
    • Added keybinding handling for Command.VOICE_INPUT to toggle recording.
    • Updated status text and indicator logic to display voice input states (recording, transcribing, error) and their corresponding icons (🎤, ⏳).
  • packages/cli/src/ui/components/snapshots/AskUserDialog.test.tsx.snap
    • Updated snapshots due to changes in the rendering tree, likely related to the VoiceContext integration.
  • packages/cli/src/ui/components/snapshots/InputPrompt.test.tsx.snap
    • Updated snapshots due to changes in the rendering tree and the integration of VoiceContext.
  • packages/cli/src/ui/components/snapshots/RewindViewer.test.tsx.snap
    • Updated snapshots due to changes in the rendering tree.
  • packages/cli/src/ui/components/shared/snapshots/BaseSelectionList.test.tsx.snap
    • Updated snapshots due to changes in the rendering tree.
  • packages/cli/src/ui/contexts/VoiceContext.test.tsx
    • New file: Added unit tests for VoiceContext to ensure it provides the correct voice input state.
  • packages/cli/src/ui/contexts/VoiceContext.tsx
    • New file: Created VoiceContext and useVoiceContext for managing and accessing voice input state, and re-exported onVoiceTranscript.
  • packages/cli/src/ui/hooks/slashCommandProcessor.test.tsx
    • Updated the mock ui object in tests to include the new toggleVoice function.
  • packages/cli/src/ui/hooks/slashCommandProcessor.ts
    • Added toggleVoice: () => void; to the SlashCommandProcessorActions interface.
    • Included actions.toggleVoice in the CommandContext['ui'] object passed to slash commands.
  • packages/cli/src/ui/hooks/useVoiceInput.log-volume.test.ts
    • New file: Added tests to verify that useVoiceInput remains silent (no visible logs) during normal operations and when processing noisy sox stderr output.
  • packages/cli/src/ui/hooks/useVoiceInput.replication.test.tsx
    • New file: Added a replication test to confirm that the event-based transcript delivery works as intended without causing excessive React re-renders in consumer components.
  • packages/cli/src/ui/hooks/useVoiceInput.stress.test.ts
    • New file: Added stress tests to ensure the useVoiceInput hook handles rapid sox progress logs and multiple toggleRecording calls efficiently without excessive re-renders.
  • packages/cli/src/ui/hooks/useVoiceInput.test.ts
    • New file: Comprehensive unit tests for useVoiceInput, covering initialization, starting/stopping recording, sox/arecord fallback, error handling, and event-based transcript emission.
  • packages/cli/src/ui/hooks/useVoiceInput.ts
    • New file: Implemented the core useVoiceInput hook, managing recording processes (sox/arecord), temporary audio file creation, Whisper transcription, and event-based transcript emission to prevent React re-render loops. Includes state management for recording/transcribing status and error handling.
  • packages/cli/src/ui/keyMatchers.test.ts
    • Updated tests for Command.PASTE_CLIPBOARD to include Cmd + V and Alt + V.
    • Added tests for the new Command.VOICE_INPUT keybindings (Alt + R).
  • packages/cli/src/ui/noninteractive/nonInteractiveUi.ts
    • Added a no-op toggleVoice function to the non-interactive UI mock.
  • schemas/settings.schema.json
    • Added the voice property to the root of the settings schema, defining its structure and the whisperPath sub-property.
Activity
  • The author, fayerman-source, initiated this pull request to introduce a significant new feature: native voice input.
  • A detailed summary, demo video, feature list, and technical implementation overview were provided in the PR description.
  • Comprehensive testing was performed, including passing existing tests, adding a new test suite (7 tests), stress tests for rapid toggling, and a clean slate verification.
  • Security and quality considerations were addressed, ensuring no secrets in code, proper temp file cleanup, and adherence to existing code style.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fantastic new feature for native voice input using Whisper. The implementation is well-structured, leveraging an event-based architecture to prevent UI re-render loops, and includes comprehensive documentation and tests. However, it introduces critical Command Injection vulnerabilities by using child_process.exec with user-controlled settings and unvalidated file paths, specifically in the handling of the user-configurable whisperPath setting. This poses a high security risk, potentially leading to Remote Code Execution (RCE), and must be addressed before merging.

@gemini-cli gemini-cli bot added the area/core Issues related to User Interface, OS Support, Core Functionality label Feb 7, 2026
@fayerman-source
Copy link
Author

🔒 Security Fix Applied

Fixed the command injection vulnerability identified by @gemini-code-assist:

Changes:

  1. **Replaced with ** - No shell spawned, no shell interpretation
  2. Arguments passed as arrays - Never concatenate user input into command strings
  3. **Added ** - Rejects paths with shell metacharacters (, , , , , etc.)
  4. Removed all usage - Including the checks, now using

Before (Vulnerable):

After (Secure):

The fix follows Node.js security best practices (2024) and prevents RCE even with malicious settings.

Replace exec() with execFile() to prevent command injection:
- Use execFile() with argument arrays instead of exec() with string concatenation
- Add validateBinaryPath() to reject paths with shell metacharacters
- Remove promisify(exec) usage entirely
- All arguments now passed as arrays, preventing shell interpretation

This addresses the security review feedback about RCE vulnerability
in the whisperPath setting.
@fayerman-source fayerman-source force-pushed the feat/voice-input-clean branch 2 times, most recently from 92776cf to 1df335c Compare February 7, 2026 06:05
@gemini-cli gemini-cli bot added priority/p2 Important but can be addressed in a future release. help wanted We will accept PRs from all issues marked as "help wanted". Thanks for your support! labels Feb 7, 2026
@fayerman-source
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-architected voice input feature. The use of an event-based system for transcript delivery is a great choice to prevent performance issues, and the implementation shows good consideration for security and cross-platform support. I've identified two high-severity issues in the core useVoiceInput hook concerning platform portability and a potential race condition that should be addressed to ensure the feature's robustness. Overall, this is a strong contribution.

@fayerman-source fayerman-source force-pushed the feat/voice-input-clean branch 3 times, most recently from 4bed5dd to 5881c3e Compare February 14, 2026 03:04
…sitivity

The hardcoded RMS threshold of 300 was cutting off whispered speech
(~100-400 RMS). Lower the default to 80 (blocks near-silence, allows
whispers in quiet environments) and expose it as a user setting.

- VoiceBackendOptions.silenceThreshold: configurable per-backend option;
  0 disables silence detection entirely
- GeminiRestBackend: reads threshold from options (default 80) with RMS
  guide comment documenting typical amplitude ranges
- settingsSchema: new voice.silenceThreshold number setting (0–1000)
- VoiceInputConfig.silenceThreshold: flows from settings through hook
- AppContainer: passes voice.silenceThreshold into voiceConfig
- /voice sensitivity <0-1000>: new subcommand to set the threshold with
  contextual hints (whisper/moderate/loud)
- /voice (status) and /voice help: both show current sensitivity level
Extract each action handler into a named SlashCommand object and register
help, enable, disable, provider, and sensitivity as subCommands on the
parent voiceCommand. This surfaces autocomplete hints while the user types
/voice, with 'help' listed first as the primary discovery pathway.

set-path remains functional but is intentionally excluded from the hints
list since it is rarely used.
…tion.md

Add entries for voice.enabled, voice.provider, voice.silenceThreshold,
and voice.whisperPath with descriptions, defaults, and runtime command
references (/voice enable, /voice sensitivity, /voice set-path).
@fayerman-source
Copy link
Author

fayerman-source commented Feb 25, 2026

@jacob314 - flagging a few updates since the last review:

Architecture changes (major):

  • Voice backends moved from packages/clipackages/core (GeminiRestBackend, LocalWhisperBackend, VoiceBackend interface)
  • Transcripts now delivered via coreEvents bus (CoreEvent.VoiceTranscript) to avoid React re-render cascades
  • /voice is now a pure settings command — recording is keyboard-only (Alt+R / Ctrl+Q). No CLI activation path.

UX fixes:

  • Placeholder text updates live during recording ("Speak now...") and transcription ("Transcribing your speech...")
  • isTranscribing indicator only shows when an actual API call is made (silent recordings are discarded before hitting the network)
  • Fixed React/Ink render batching issue — added setImmediate yield so the ⏳ indicator actually renders before the Gemini call

New: configurable silence detection

  • voice.silenceThreshold setting (0–1000 RMS, default 80) — allows whispered speech at default, adjustable for noisy environments
  • /voice sensitivity <value> runtime command
  • /voice now shows autocomplete hints for subcommands (help, enable, disable, provider, sensitivity)

Docs: docs/cli/settings.md and docs/reference/configuration.md updated with all four voice settings.

Happy to make iterate further based on your feedback.

Keep VOICE_INPUT test entry from our branch alongside upstream's
removal of the trailing blank line before the shell commands section.
@jacob314
Copy link
Contributor

Thanks for the updates! The event-driven architecture is a big improvement. I ran a detailed review (using the /review-frontend command powered by Gemini 1.5 Pro, manually reviewed by Jacob) and found a few issues that need addressing before we can merge:

1. Failing Tests (Blocker)

packages/cli/src/ui/hooks/useVoiceInput.log-volume.test.ts, useVoiceInput.replication.test.tsx, and useVoiceInput.stress.test.ts are currently failing. They call renderHook(() => useVoiceInput()) without providing voiceConfig.config. Since config is required to initialize the default GeminiRestBackend, the backend resolves to null, causing a "useVoiceInput: no backend initialized" error. This fails the log-volume assertions and prevents state transitions in the stress and replication tests. (It looks like config: mockConfig was added to useVoiceInput.test.ts but missed in these files).

2. Test Stability / Rules Violations

The log-volume, replication, and stress tests contain multiple instances of fixed-time setTimeout calls (e.g., setTimeout(() => mockProcess.emit('exit', 0), 5);, await new Promise((resolve) => setTimeout(resolve, 50));). Our strict-development-rules.md explicitly forbids fixed waits to prevent flakiness. Please update these to use waitFor() from packages/cli/src/test-utils/async.ts with a predicate, or utilize vi.useFakeTimers().

3. Architectural Boundary Violation in Core

In packages/core/src/services/voice/GeminiRestBackend.ts, there is an explicit yield to the Node event loop:

// Yield one macrotask so Ink can flush the isTranscribing:true state
// update and render ⏳ before the network call begins. Without this,
// React batches both state changes and the indicator never renders.
await new Promise<void>((resolve) => setImmediate(resolve));

While this solves the rendering issue, it leaks UI/React batching concerns into packages/core. The core services should not know about Ink rendering semantics. Consider handling the rendering loop/yield within the UI hook or component (e.g., using a separate useEffect to trigger the network call after state updates, or flushSync).

4. File I/O Polling in LocalWhisperBackend.ts

After sending a SIGINT to the recording process, the backend waits up to 2000ms for an exit event and then performs a manual polling loop over 1 second (await new Promise((r) => setTimeout(r, 50)); x 20) to ensure the stat size > 0. This is a bit of a code smell. Relying on the child process's close event (which indicates that the stdio streams of a child process have been closed) rather than just exit usually guarantees that all file handles have been flushed.

Let me know if you need any clarification or help with these!

@jacob314
Copy link
Contributor

Nit: voice input shows up as an Object in the settings dialog. Please change the root object to not be shown in the dialog to fix.
image

@jacob314
Copy link
Contributor

Colors for the voice input help commands should be tweaked.
image
Align them with the colors you see used in /help

requiresRestart: false,
default: {},
description: 'Settings for voice input.',
showInDialog: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove showInDialog for line 776. The object itself should not be shown in the dialog just children.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — removed showInDialog: true from the parent voice object in settingsSchema.ts. Child properties that should appear in the settings dialog retain their individual flags.


if (keyMatchers[Command.ESCAPE](key)) {
if (voiceState.isRecording) {
void toggleRecording();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this toggle? would think it should cancel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — added cancel() to the VoiceBackend interface and implemented it in both GeminiRestBackend and LocalWhisperBackend. It kills the recording process and discards audio without transcribing. useVoiceInput now exposes cancelRecording, and InputPrompt calls it on Escape when recording is active. Unit tests added for both the happy path and no-op case.

}

// Voice input status
if (voiceState.isRecording) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these messages aren't showing for me. I simply see a red > and no other indicator. I've repeatedly pressed Q

Image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I look closely it appears the microphone rendered for 1 frame and then immediately went away. However there are not any errors in my console and I am on a vanilla macbook pro

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recording indicator should now render reliably. The root issue was that Ink needs a macrotask yield to flush state updates before the network call begins — setImmediate was previously in core (a layer boundary violation) and has since been moved to the UI hook's onStateChange wrapper. If you can still reproduce on Mac, please let us know.


| Action | Keys |
| ----------------------------------------------- | ------------------------- |
| Toggle voice input recording (Alt+R or Ctrl+Q). | `Alt + R`<br />`Ctrl + Q` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alt+R does not work on mac. if you want it to work you would have to make additional changes.

Once the rest of the PR is reviewed we will need to circle back with @scidomino to figure out viable keyboard shortcuts for this if any. we may need to launch with just /voice as there are not many shortcuts available at least until we support custom shortcuts.

pressing option-r on mac just generates ®

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On mac I just use the built in dictation feature which by default is mapped to hitting the fn key twice.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac users: Ctrl+Q should work in iTerm2/Warp.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fayerman-source Pretty much every key combo is available in some environments and already taken in others.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider is that a competitor recently adopted "press-and-hold of spacebar" for a very similar feature.

- Revert unrelated DebugProfiler hunk (extensionsStarting/Stopping
  listeners belong in PR google-gemini#20101, not voice PR)
- Remove showInDialog from parent voice object in settingsSchema
  (only child properties should have it)
- Escape key now calls cancelRecording() instead of toggleRecording()
  so it discards audio without transcribing; adds cancel() to
  VoiceBackend interface and both backends
- Move setImmediate yield out of GeminiRestBackend (core) into
  useVoiceInput (UI): onStateChange now returns Promise<void> so the
  UI layer can insert the yield after isTranscribing:true
- Replace LocalWhisperBackend stat polling loop with close event
  listener for cleaner process lifecycle handling
- /voice help now renders VoiceHelp component styled with theme
  colors matching /help (accent for commands, primary for labels)
- Fix 3 failing test files (log-volume, stress, replication): rewrite
  to mock @google/gemini-cli-core backends, use waitFor() from
  test-utils, fix afterEach spy cleanup to not reset module mocks
- Change debugLogger.log() to .debug() in useVoiceInput for internal
  tracing (prevents logSpy from catching them in tests)
@fayerman-source
Copy link
Author

PR Review: LGTM 🚀

The native voice input feature is well-architected and follows the project's strict development rules.

Key Strengths:

  • Event-Based Architecture: The use of coreEvents for transcript delivery effectively prevents React re-render cascades and infinite loops.
  • Clean Separation of Concerns: Core logic correctly resides in packages/core, while packages/cli handles UI and keypress orchestration.
  • Robust Testing: The inclusion of stress, replication, and log-volume tests ensures stability and performance.
  • Proactive Refinements: Recent changes correctly move UI-specific yields (setImmediate) out of core and improve process lifecycle management in LocalWhisperBackend.
  • Security: Use of execFile and path validation mitigates command injection risks.

Final Verification:

  • Verified that cancelRecording is now fully covered by unit tests.
  • Backend mocks in useVoiceInput.test.ts have been updated to include cancel().
  • ESLint no-floating-promises issues have been resolved.

Great work on this feature!

@SandyTao520 SandyTao520 force-pushed the main branch 2 times, most recently from 4a78a96 to 0bf3a01 Compare February 26, 2026 03:03
@fayerman-source
Copy link
Author

Fixed (re: settings dialog) — removed showInDialog: true from the root voice object in settingsSchema.ts. The object itself no longer appears in the settings dialog; only the individual child properties do.

@fayerman-source
Copy link
Author

Fixed (re: colors) — /voice help now renders via a VoiceHelp component using theme.text.accent and theme.text.primary, matching /help. The /voice status output also received the same treatment via a new VoiceStatus component — both use the themed bordered-box style.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Issues related to User Interface, OS Support, Core Functionality help wanted We will accept PRs from all issues marked as "help wanted". Thanks for your support! priority/p2 Important but can be addressed in a future release.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Proposal: Unified Native Voice Input Architecture (Local-First)

5 participants