-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
9 / 99 of 9 issues completedLabels
enhancementNew feature or requestNew feature or requestepicMilestone-level tracking issueMilestone-level tracking issue
Description
Overview
Add support for sending commands to the agent via audio — both recorded audio files (Telegram voice messages, Discord/Slack file uploads) and real-time audio streaming where supported by provider APIs.
Motivation
Voice input is a natural interaction mode for AI agents. Telegram, Discord, and Slack all support voice/audio messages natively, but Zeph currently ignores all non-text content. Adding audio support unlocks hands-free interaction and expands accessibility.
Architecture
Audio input follows a two-stage pipeline:
Audio Source → Transcription (STT) → Text → Agent Loop (unchanged)
Transcription backends (pluggable):
- OpenAI Whisper API — cloud, high accuracy, 25 MB file limit
- Local Whisper (candle/whisper.cpp) — offline, no API costs, GPU optional
- Native multimodal — Claude and GPT-4o accept audio natively (no separate STT step)
Key design decisions:
- Audio is transcribed to text before entering the agent loop — minimal changes to core
ChannelMessagegains anattachments: Vec<Attachment>field- A new
SpeechToTexttrait abstracts transcription backends - Channel adapters extract audio from platform-specific message types
- Transcription happens at the channel boundary, not in the agent
Sub-issues
- multimodal: extend ChannelMessage and MessagePart with attachment support #521 Extend
ChannelMessageandMessagePartwith multimodal attachment support - audio: add SpeechToText trait and OpenAI Whisper backend #522 Add
SpeechToTexttrait and OpenAI Whisper backend - audio: add local Whisper backend via candle (feature-gated) #523 Add local Whisper backend via candle (feature-gated)
- audio: Discord voice message and audio attachment handling #524 Telegram: handle voice messages and audio files
- audio: Slack audio file upload handling #525 Discord: handle voice messages and audio attachments
- audio: Telegram voice message and audio file handling #526 Slack: handle audio file uploads
- audio: real-time streaming STT support #527 CLI: audio file path as input argument
- audio: configuration section and documentation #528 Streaming audio input support (real-time STT)
- audio: CLI audio file input support #529 Configuration and documentation
Non-goals (v1)
- Text-to-speech (TTS) output
- Voice call support (phone, WebRTC)
- Video input
References
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestepicMilestone-level tracking issueMilestone-level tracking issue