-
Notifications
You must be signed in to change notification settings - Fork 2.7k
feat(google): add streaming support for Gemini TTS models #4189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
CI Note: The |
- Add model_name parameter to TTS for Gemini model support - Add prompt parameter for style control (applied to first chunk only) - Update model property to return actual model name when set - Bump google-cloud-texttospeech minimum version to 2.32 (required for model_name field) - Maintain backward compatibility with existing Chirp3 usage Fixes livekit#3864
f7313e5 to
c756741
Compare
|
Update: Tested all model/voice combinations locally. Below is the test script and results: """
Test matrix for Google TTS - Gemini and Chirp3 models
Run each test case and record results
"""
import asyncio
from livekit.plugins.google import TTS
from google.cloud import texttospeech
# Credentials file path
CREDENTIALS_FILE = "your/credential/file.json"
async def test_tts(name: str, **kwargs):
"""Test TTS configuration and print result"""
try:
# Always use credentials_file
kwargs["credentials_file"] = CREDENTIALS_FILE
tts = TTS(**kwargs)
# Use stream() to test streaming
stream = tts.stream()
stream.push_text("Hello, this is a test of the text to speech system.")
stream.end_input()
chunks = 0
async for event in stream:
chunks += 1
print(f"[PASS] {name} - {chunks} chunks received")
return True
except Exception as e:
print(f"[FAIL] {name} - {e}")
return False
async def main():
results = []
# === CHIRP3 (Backward Compatibility) ===
print("\n=== CHIRP3 BACKWARD COMPATIBILITY ===")
# Test 1: Default (no model_name) - should work as before
results.append(await test_tts(
"Chirp3 Default",
# No model_name - uses Chirp3 by default
))
# Test 2: Explicit Chirp3 voice
results.append(await test_tts(
"Chirp3 Explicit Voice",
voice_name="en-US-Chirp3-HD-Charon",
language="en-US",
))
# Test 3: Chirp3 with speaking_rate
results.append(await test_tts(
"Chirp3 with speaking_rate",
voice_name="en-US-Chirp3-HD-Charon",
speaking_rate=1.2,
))
# === GEMINI 2.5 FLASH TTS ===
print("\n=== GEMINI 2.5 FLASH TTS ===")
# Test 4: Gemini Flash basic
results.append(await test_tts(
"Gemini Flash - Kore",
model_name="gemini-2.5-flash-tts",
voice_name="Kore",
language="en-US",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# Test 5: Gemini Flash with prompt
results.append(await test_tts(
"Gemini Flash - Puck with prompt",
model_name="gemini-2.5-flash-tts",
voice_name="Puck",
language="en-US",
prompt="Speak in a friendly and casual tone",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# Test 6: Gemini Flash different voice
results.append(await test_tts(
"Gemini Flash - Charon",
model_name="gemini-2.5-flash-tts",
voice_name="Charon",
language="en-US",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# Test 7: Gemini Flash - Aoede
results.append(await test_tts(
"Gemini Flash - Aoede",
model_name="gemini-2.5-flash-tts",
voice_name="Aoede",
language="en-US",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# === GEMINI 2.5 PRO TTS ===
print("\n=== GEMINI 2.5 PRO TTS ===")
# Test 8: Gemini Pro basic
results.append(await test_tts(
"Gemini Pro - Callirrhoe",
model_name="gemini-2.5-pro-tts",
voice_name="Callirrhoe",
language="en-US",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# Test 9: Gemini Pro with detailed prompt
results.append(await test_tts(
"Gemini Pro - with detailed prompt",
model_name="gemini-2.5-pro-tts",
voice_name="Kore",
language="en-US",
prompt="You are a professional narrator. Speak clearly and confidently.",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# === MULTI-LANGUAGE ===
print("\n=== MULTI-LANGUAGE ===")
# Test 10: Korean
results.append(await test_tts(
"Gemini Flash - Korean",
model_name="gemini-2.5-flash-tts",
voice_name="Kore",
language="ko-KR",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# Test 11: Japanese
results.append(await test_tts(
"Gemini Flash - Japanese",
model_name="gemini-2.5-flash-tts",
voice_name="Aoede",
language="ja-JP",
audio_encoding=texttospeech.AudioEncoding.PCM,
))
# === EDGE CASES ===
print("\n=== EDGE CASES ===")
# Test 12: Update options at runtime
try:
tts = TTS(model_name="gemini-2.5-flash-tts", voice_name="Kore", credentials_file=CREDENTIALS_FILE)
tts.update_options(voice_name="Puck", prompt="Speak slowly")
print(f"[PASS] update_options() - voice changed to Puck")
results.append(True)
except Exception as e:
print(f"[FAIL] update_options() - {e}")
results.append(False)
# Test 13: Model property returns correct value
try:
tts1 = TTS(credentials_file=CREDENTIALS_FILE) # Default
tts2 = TTS(model_name="gemini-2.5-flash-tts", voice_name="Kore", credentials_file=CREDENTIALS_FILE)
assert tts1.model == "Chirp3", f"Expected 'Chirp3', got '{tts1.model}'"
assert tts2.model == "gemini-2.5-flash-tts", f"Expected 'gemini-2.5-flash-tts', got '{tts2.model}'"
print(f"[PASS] model property - Chirp3={tts1.model}, Gemini={tts2.model}")
results.append(True)
except Exception as e:
print(f"[FAIL] model property - {e}")
results.append(False)
# Summary
print(f"\n=== SUMMARY ===")
passed = sum(results)
total = len(results)
print(f"Passed: {passed}/{total}")
if __name__ == "__main__":
asyncio.run(main())=== CHIRP3 BACKWARD COMPATIBILITY === === GEMINI 2.5 FLASH TTS === === GEMINI 2.5 PRO TTS === === MULTI-LANGUAGE === === EDGE CASES === === SUMMARY === |
|
Hi, thank you for your support and PR 😄! We'll take a look, could you also sign the Contributor License Agreement? |
|
Thanks for the quick response! Just signed the CLA. Let me know if you need anything else. |
theomonnom
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks!
|
When will this feature go live? |
* main: (267 commits) AGT-2328: negative threshold in silero (livekit#4228) disable interruptions for agent greeting (livekit#4223) feature: GPT-5.2 support (livekit#4235) turn-detector: remove english model from readme (livekit#4233) add keep alive task for liveavatar plugin (livekit#4231) feat(warm-transfer): add sip_number parameter for outbound caller ID (livekit#4216) fix blocked send task in liveavatar plugin (livekit#4214) clear _q_updated right after await to avoid race conditions (livekit#4209) ensure playback_segments_count is consistent in the audio output chain (livekit#4211) fix inworld punctuation handling (livekit#4215) Inference: Rename fallback model name param (livekit#4202) fix race condition when stop background audio play handle (livekit#4197) fix watchfiles prevent agent prcoess exit on sigterm (livekit#4194) feat(google): add streaming support for Gemini TTS models (livekit#4189) Add LiveAvatar Stop Session API Call + README Fix (livekit#4195) Fallback API for Inference (livekit#4099) feat(rime): expand update_options to accept all TTS parameters (livekit#4095) mistralai models update (livekit#4156) fix record.exc_info is not pickable when using LogQueueHandler (livekit#4185) Restore otel chat message (livekit#4118) ...
|
Thanks for the contribution, but when I used the stream fix, i got this error: ❌ Critical error in entrypoint: TTS.init() got an unexpected keyword argument 'model_name' , if there something I did wrong: tts=TTS( |
|
Thanks for trying it out! @james-intallaga Two things:
I'll open a follow-up PR to auto-set PCM when a Gemini model is detected. |
|
sorry, but it still does not work, do we need to get a cloud api-key instead of gemini api key only? |
|
I set the cloud api and it works now, thanks very much. The latency is still very big(about 1-3 seconds, llm model is gemini 3 flash), I am not sure if it is normal. |
|
As I know, 1-3 seconds is typical for a full voice pipeline. The latency comes from multiple stages:
This is expected behavior for STT -> LLM -> TTS pipelines. Here's an example from my logs: In this example, TTS started after LLM completion. Once TTS starts, it streams audio chunks correctly (ttfb=0.810s, streamed=True). 1-3 seconds end-to-end is expected for voice pipelines. |
|
Thanks for the detailed analysis, solid! And it does make sense to have such latency with current pipeline of the whole industry, but hopefully some new structure will emerge in 2026 and reduce the latency under 1 sec. A voice chatbot with a latency bigger than 1 sec for each response in voice will drive users away. Realtime conversation will then be real realtime conversation. |
Summary
Add streaming support for Gemini TTS via Google Cloud TTS API's
model_nameparameter.Fixes #3864
Motivation
My company is launching a product using Gemini TTS in the next two weeks.
We initially implemented a separate module internally to work around the streaming limitation.
However, I noticed the discussion in #3864 about potentially integrating Gemini streaming into
google.TTSrather than keeping it separate, so I decided to implement it this way to contribute back to the community.I'm a big fan of LiveKit Agents and would love to see this feature land. If there are any concerns about backward compatibility, additional testing needs, or code style adjustments, I'm more than willing to iterate and do the work required to get this merged.
Approach
This PR adds Gemini TTS streaming support through the Cloud TTS API (not the Gemini AI API).
beta.GeminiTTSusesgoogle.genai(no streaming)google.cloud.texttospeechwithmodel_nameparameter (streaming supported)Both approaches can coexist - users can choose based on their needs:
beta.GeminiTTSfor Gemini AI API with instruction-based style controlTTS(model_name="gemini-2.5-flash-tts")for streaming supportBackward Compatibility
NOT_GIVEN)google.TTS()continues to work unchangedmodelproperty returns"Chirp3"whenmodel_nameis not setChanges
model_nameparameter to TTS for Gemini model support (e.g.,gemini-2.5-flash-tts)promptparameter for style control (applied to first input chunk only per Google TTS API spec)modelproperty to return actual model name when setupdate_options()to support dynamic model/prompt changesUsage Example
from livekit.plugins.google import TTS
Gemini TTS with streaming
tts = TTS(
model_name="gemini-2.5-flash-tts",
voice_name="zephyr",
language="ko-KR",
prompt="Speak in a friendly, conversational tone",
)
Testing
Tested with
gemini-2.5-flash-ttsmodel and zephyr voice in our production environment.Happy to address any feedback or make adjustments as needed!