-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Context & Problem
There is significant community interest in Voice Input for Gemini CLI, currently fragmented across multiple discussions:
- Enhance Gemini CLI with voice interaction capabilities. #13798: Core feature request for STT/TTS (currently suggests "wrapper scripts").
- Audio input instead of typing #1982: General "Voice Mode" discussion (leans towards external MCP servers).
- MAJOR UPDATE ( I DID ) #16461: Requests for voice loops and complex backend overhauls.
Current Gap: The ecosystem lacks a lightweight, privacy-first, and native solution. Users currently have to choose between complex MCP setups, cloud-based dependencies (OpenAI API), or external wrapper scripts that break the CLI's native UX.
The Proposal
I propose adding a native Voice Input hook directly into the core CLI (packages/cli).
Architecture:
- Local-First: Uses standard system binaries (
soxon macOS,arecordon Linux) for capture. - Privacy: Transcription is handled by a local
whisperbinary (configurable path), ensuring no audio leaves the user's machine. - Integrated UX:
- Toggles via
Alt+V,Ctrl+Q, or/voiceslash command. - Visual status indicator (
🎤 Recording...) directly in theInputPromptheader. - Inserts text at cursor position (maintaining editability).
- Toggles via
Implementation Status
I have fully implemented and verified this architecture.
My branch includes:
- ✅
useVoiceInputReact hook for process management. - ✅
VoiceContextfor global state. - ✅ Unit tests for recording logic and shortcut bindings.
- ✅ Optimized low-latency polling for transcription.
Request
I believe this implementation solves the core requirement of #13798 while adhering to the privacy and performance standards of the project.
Is the team open to a PR for this native integration? I am ready to push the branch immediately.