feat: Add voice input with pluggable backend (Gemini zero-install + Whisper)#18499
feat: Add voice input with pluggable backend (Gemini zero-install + Whisper)#18499fayerman-source wants to merge 62 commits intogoogle-gemini:mainfrom
Conversation
- Resolve key binding conflict: Restore Alt+V for paste, use Alt+R for voice (Fixes google-gemini#1234) - Fix import violation: Use @google/gemini-cli-core for tmpdir - Add missing documentation for key bindings and settings - Add tests for VoiceContext and fix types in useVoiceInput tests - Fix Ctrl+C handling in InputPrompt (regression fix) - Fix infinite render loop in Voice Input (AppContainer/Composer fix)
…render loop Replace context-based transcript delivery with EventEmitter pattern: - Remove transcript from VoiceInputState (delivered via events instead) - Remove clearTranscript function (no longer needed) - Add onVoiceTranscript subscription function - InputPrompt listens for transcript events instead of reading from context - Add useMemo to prevent context value churn This eliminates the '6-8 frames rendered while idle' error by: - Preventing context updates when transcript changes - Eliminating the set/clear cycle that caused render loops - Keeping recording state (isRecording, isTranscribing, error) in context Fixes: VOICE_INFINITE_LOOP_ANALYSIS.md
- Add replication test to verify no infinite render loop - Add stress test for rapid toggle recording calls - Add log volume test to ensure no excessive logging
Remove hardcoded --language English from whisper commands to allow auto-detection of spoken language. This prevents transliteration (Russian spoken → English letters) and correctly transcribes in the original language (Russian → Russian text). Affected whisper variants: - Configured whisper path - whisper-faster - whisper (Python package)
Summary of ChangesHello @fayerman-source, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Gemini CLI by integrating a native voice input capability. This feature aims to streamline user interaction by enabling natural speech-to-text conversion, making the CLI more accessible and efficient. The implementation focuses on a robust, event-driven design to maintain performance and provides clear feedback to the user throughout the voice input process. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a fantastic new feature for native voice input using Whisper. The implementation is well-structured, leveraging an event-based architecture to prevent UI re-render loops, and includes comprehensive documentation and tests. However, it introduces critical Command Injection vulnerabilities by using child_process.exec with user-controlled settings and unvalidated file paths, specifically in the handling of the user-configurable whisperPath setting. This poses a high security risk, potentially leading to Remote Code Execution (RCE), and must be addressed before merging.
🔒 Security Fix AppliedFixed the command injection vulnerability identified by @gemini-code-assist: Changes:
Before (Vulnerable):After (Secure):The fix follows Node.js security best practices (2024) and prevents RCE even with malicious settings. |
Replace exec() with execFile() to prevent command injection: - Use execFile() with argument arrays instead of exec() with string concatenation - Add validateBinaryPath() to reject paths with shell metacharacters - Remove promisify(exec) usage entirely - All arguments now passed as arrays, preventing shell interpretation This addresses the security review feedback about RCE vulnerability in the whisperPath setting.
92776cf to
1df335c
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-architected voice input feature. The use of an event-based system for transcript delivery is a great choice to prevent performance issues, and the implementation shows good consideration for security and cross-platform support. I've identified two high-severity issues in the core useVoiceInput hook concerning platform portability and a potential race condition that should be addressed to ensure the feature's robustness. Overall, this is a strong contribution.
7a6a42c to
c6a9409
Compare
4bed5dd to
5881c3e
Compare
6bd08ba to
c6c7329
Compare
…sitivity The hardcoded RMS threshold of 300 was cutting off whispered speech (~100-400 RMS). Lower the default to 80 (blocks near-silence, allows whispers in quiet environments) and expose it as a user setting. - VoiceBackendOptions.silenceThreshold: configurable per-backend option; 0 disables silence detection entirely - GeminiRestBackend: reads threshold from options (default 80) with RMS guide comment documenting typical amplitude ranges - settingsSchema: new voice.silenceThreshold number setting (0–1000) - VoiceInputConfig.silenceThreshold: flows from settings through hook - AppContainer: passes voice.silenceThreshold into voiceConfig - /voice sensitivity <0-1000>: new subcommand to set the threshold with contextual hints (whisper/moderate/loud) - /voice (status) and /voice help: both show current sensitivity level
Extract each action handler into a named SlashCommand object and register help, enable, disable, provider, and sensitivity as subCommands on the parent voiceCommand. This surfaces autocomplete hints while the user types /voice, with 'help' listed first as the primary discovery pathway. set-path remains functional but is intentionally excluded from the hints list since it is rarely used.
…tion.md Add entries for voice.enabled, voice.provider, voice.silenceThreshold, and voice.whisperPath with descriptions, defaults, and runtime command references (/voice enable, /voice sensitivity, /voice set-path).
|
@jacob314 - flagging a few updates since the last review: Architecture changes (major):
UX fixes:
New: configurable silence detection
Docs: Happy to make iterate further based on your feedback. |
Keep VOICE_INPUT test entry from our branch alongside upstream's removal of the trailing blank line before the shell commands section.
|
Thanks for the updates! The event-driven architecture is a big improvement. I ran a detailed review (using the 1. Failing Tests (Blocker)
2. Test Stability / Rules ViolationsThe 3. Architectural Boundary Violation in CoreIn // Yield one macrotask so Ink can flush the isTranscribing:true state
// update and render ⏳ before the network call begins. Without this,
// React batches both state changes and the indicator never renders.
await new Promise<void>((resolve) => setImmediate(resolve));While this solves the rendering issue, it leaks UI/React batching concerns into 4. File I/O Polling in
|
| requiresRestart: false, | ||
| default: {}, | ||
| description: 'Settings for voice input.', | ||
| showInDialog: true, |
There was a problem hiding this comment.
remove showInDialog for line 776. The object itself should not be shown in the dialog just children.
There was a problem hiding this comment.
Fixed — removed showInDialog: true from the parent voice object in settingsSchema.ts. Child properties that should appear in the settings dialog retain their individual flags.
|
|
||
| if (keyMatchers[Command.ESCAPE](key)) { | ||
| if (voiceState.isRecording) { | ||
| void toggleRecording(); |
There was a problem hiding this comment.
why does this toggle? would think it should cancel
There was a problem hiding this comment.
Done — added cancel() to the VoiceBackend interface and implemented it in both GeminiRestBackend and LocalWhisperBackend. It kills the recording process and discards audio without transcribing. useVoiceInput now exposes cancelRecording, and InputPrompt calls it on Escape when recording is active. Unit tests added for both the happy path and no-op case.
| } | ||
|
|
||
| // Voice input status | ||
| if (voiceState.isRecording) { |
There was a problem hiding this comment.
When I look closely it appears the microphone rendered for 1 frame and then immediately went away. However there are not any errors in my console and I am on a vanilla macbook pro
There was a problem hiding this comment.
The recording indicator should now render reliably. The root issue was that Ink needs a macrotask yield to flush state updates before the network call begins — setImmediate was previously in core (a layer boundary violation) and has since been moved to the UI hook's onStateChange wrapper. If you can still reproduce on Mac, please let us know.
|
|
||
| | Action | Keys | | ||
| | ----------------------------------------------- | ------------------------- | | ||
| | Toggle voice input recording (Alt+R or Ctrl+Q). | `Alt + R`<br />`Ctrl + Q` | |
There was a problem hiding this comment.
Alt+R does not work on mac. if you want it to work you would have to make additional changes.
Once the rest of the PR is reviewed we will need to circle back with @scidomino to figure out viable keyboard shortcuts for this if any. we may need to launch with just /voice as there are not many shortcuts available at least until we support custom shortcuts.
pressing option-r on mac just generates ®
There was a problem hiding this comment.
On mac I just use the built in dictation feature which by default is mapped to hitting the fn key twice.
There was a problem hiding this comment.
Mac users: Ctrl+Q should work in iTerm2/Warp.
There was a problem hiding this comment.
@fayerman-source Pretty much every key combo is available in some environments and already taken in others.
There was a problem hiding this comment.
Something to consider is that a competitor recently adopted "press-and-hold of spacebar" for a very similar feature.
- Revert unrelated DebugProfiler hunk (extensionsStarting/Stopping listeners belong in PR google-gemini#20101, not voice PR) - Remove showInDialog from parent voice object in settingsSchema (only child properties should have it) - Escape key now calls cancelRecording() instead of toggleRecording() so it discards audio without transcribing; adds cancel() to VoiceBackend interface and both backends - Move setImmediate yield out of GeminiRestBackend (core) into useVoiceInput (UI): onStateChange now returns Promise<void> so the UI layer can insert the yield after isTranscribing:true - Replace LocalWhisperBackend stat polling loop with close event listener for cleaner process lifecycle handling - /voice help now renders VoiceHelp component styled with theme colors matching /help (accent for commands, primary for labels) - Fix 3 failing test files (log-volume, stress, replication): rewrite to mock @google/gemini-cli-core backends, use waitFor() from test-utils, fix afterEach spy cleanup to not reset module mocks - Change debugLogger.log() to .debug() in useVoiceInput for internal tracing (prevents logSpy from catching them in tests)
PR Review: LGTM 🚀The native voice input feature is well-architected and follows the project's strict development rules. Key Strengths:
Final Verification:
Great work on this feature! |
4a78a96 to
0bf3a01
Compare
|
Fixed (re: settings dialog) — removed |
|
Fixed (re: colors) — |



Summary
This PR adds native voice input to Gemini CLI with a pluggable backend architecture — defaulting to zero-install Gemini transcription (addressing @jacob314's request), with optional local Whisper support for offline use.
Update: Gemini Backend Added (addresses @jacob314's feedback)
Per @jacob314's comment requesting a zero-install option using the Gemini API, we've added a
GeminiRestBackendthat:soxorarecord(standard Linux/macOS tools, no separate install)generateContentusing the user's existing Gemini auth (OAuth or API key)gemini-2.0-flash— no additional setup requiredFeatures
Alt+Ror type/voiceto toggle recordingvoice.provider: "gemini"(default) or"whisper"(local)Ctrl+Qfor conflicting keybindings/voicetoggle,/voice set-path <path>,/voice helpSettings
{ "voice": { "enabled": true, // Enable voice input (default: false) "provider": "gemini", // "gemini" (default, zero-install) or "whisper" "whisperPath": "/usr/bin/whisper" // Optional: path to Whisper binary } }Architecture
Key design decisions:
EventEmitterinstead of React state to prevent re-render cascadesconfig.getContentGenerator()— works with both OAuth and API keyisTogglingRefguard prevents overlapping toggle callsAudio Pipeline (Gemini backend)
sox(cross-platform) orarecord(Linux fallback)audio/wavtogenerateContentwithgemini-2.0-flashTesting
VoiceContexttests passCloses #18067