Skip to content

epic: audio input support for agent commands #520

@bug-ops

Description

@bug-ops

Overview

Add support for sending commands to the agent via audio — both recorded audio files (Telegram voice messages, Discord/Slack file uploads) and real-time audio streaming where supported by provider APIs.

Motivation

Voice input is a natural interaction mode for AI agents. Telegram, Discord, and Slack all support voice/audio messages natively, but Zeph currently ignores all non-text content. Adding audio support unlocks hands-free interaction and expands accessibility.

Architecture

Audio input follows a two-stage pipeline:

Audio Source → Transcription (STT) → Text → Agent Loop (unchanged)

Transcription backends (pluggable):

  1. OpenAI Whisper API — cloud, high accuracy, 25 MB file limit
  2. Local Whisper (candle/whisper.cpp) — offline, no API costs, GPU optional
  3. Native multimodal — Claude and GPT-4o accept audio natively (no separate STT step)

Key design decisions:

  • Audio is transcribed to text before entering the agent loop — minimal changes to core
  • ChannelMessage gains an attachments: Vec<Attachment> field
  • A new SpeechToText trait abstracts transcription backends
  • Channel adapters extract audio from platform-specific message types
  • Transcription happens at the channel boundary, not in the agent

Sub-issues

Non-goals (v1)

  • Text-to-speech (TTS) output
  • Voice call support (phone, WebRTC)
  • Video input

References

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestepicMilestone-level tracking issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions