Skip to content

Conversation

@plumber0
Copy link
Contributor

@plumber0 plumber0 commented Dec 6, 2025

Summary

Add streaming support for Gemini TTS via Google Cloud TTS API's model_name parameter.

Fixes #3864

Motivation

My company is launching a product using Gemini TTS in the next two weeks.
We initially implemented a separate module internally to work around the streaming limitation.
However, I noticed the discussion in #3864 about potentially integrating Gemini streaming into google.TTS rather than keeping it separate, so I decided to implement it this way to contribute back to the community.

I'm a big fan of LiveKit Agents and would love to see this feature land. If there are any concerns about backward compatibility, additional testing needs, or code style adjustments, I'm more than willing to iterate and do the work required to get this merged.

Approach

This PR adds Gemini TTS streaming support through the Cloud TTS API (not the Gemini AI API).

  • The existing beta.GeminiTTS uses google.genai (no streaming)
  • This enhancement uses google.cloud.texttospeech with model_name parameter (streaming supported)

Both approaches can coexist - users can choose based on their needs:

  • Use beta.GeminiTTS for Gemini AI API with instruction-based style control
  • Use TTS(model_name="gemini-2.5-flash-tts") for streaming support

Backward Compatibility

  • All new parameters have default values (NOT_GIVEN)
  • Existing code using google.TTS() continues to work unchanged
  • The model property returns "Chirp3" when model_name is not set
  • No breaking changes

Changes

  • Add model_name parameter to TTS for Gemini model support (e.g., gemini-2.5-flash-tts)
  • Add prompt parameter for style control (applied to first input chunk only per Google TTS API spec)
  • Update model property to return actual model name when set
  • Update update_options() to support dynamic model/prompt changes

Usage Example

from livekit.plugins.google import TTS

Gemini TTS with streaming

tts = TTS(
model_name="gemini-2.5-flash-tts",
voice_name="zephyr",
language="ko-KR",
prompt="Speak in a friendly, conversational tone",
)

Testing

Tested with gemini-2.5-flash-tts model and zephyr voice in our production environment.


Happy to address any feedback or make adjustments as needed!

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@plumber0
Copy link
Contributor Author

plumber0 commented Dec 6, 2025

CI Note: The test_summarize failure is unrelated to this PR - it requires OPENAI_API_KEY which is not available to fork PRs. All other tests pass.

- Add model_name parameter to TTS for Gemini model support
- Add prompt parameter for style control (applied to first chunk only)
- Update model property to return actual model name when set
- Bump google-cloud-texttospeech minimum version to 2.32 (required for model_name field)
- Maintain backward compatibility with existing Chirp3 usage

Fixes livekit#3864
@plumber0 plumber0 force-pushed the feature/gemini-tts-streaming branch from f7313e5 to c756741 Compare December 8, 2025 00:56
@agent-felix
Copy link
Contributor

Update: Tested all model/voice combinations locally. Below is the test script and results:

"""
Test matrix for Google TTS - Gemini and Chirp3 models
Run each test case and record results
"""
import asyncio
from livekit.plugins.google import TTS
from google.cloud import texttospeech

# Credentials file path
CREDENTIALS_FILE = "your/credential/file.json"

async def test_tts(name: str, **kwargs):
    """Test TTS configuration and print result"""
    try:
        # Always use credentials_file
        kwargs["credentials_file"] = CREDENTIALS_FILE
        tts = TTS(**kwargs)
        # Use stream() to test streaming
        stream = tts.stream()
        stream.push_text("Hello, this is a test of the text to speech system.")
        stream.end_input()
        
        chunks = 0
        async for event in stream:
            chunks += 1
        
        print(f"[PASS] {name} - {chunks} chunks received")
        return True
    except Exception as e:
        print(f"[FAIL] {name} - {e}")
        return False

async def main():
    results = []
    
    # === CHIRP3 (Backward Compatibility) ===
    print("\n=== CHIRP3 BACKWARD COMPATIBILITY ===")
    
    # Test 1: Default (no model_name) - should work as before
    results.append(await test_tts(
        "Chirp3 Default",
        # No model_name - uses Chirp3 by default
    ))
    
    # Test 2: Explicit Chirp3 voice
    results.append(await test_tts(
        "Chirp3 Explicit Voice",
        voice_name="en-US-Chirp3-HD-Charon",
        language="en-US",
    ))
    
    # Test 3: Chirp3 with speaking_rate
    results.append(await test_tts(
        "Chirp3 with speaking_rate",
        voice_name="en-US-Chirp3-HD-Charon",
        speaking_rate=1.2,
    ))
    
    # === GEMINI 2.5 FLASH TTS ===
    print("\n=== GEMINI 2.5 FLASH TTS ===")
    
    # Test 4: Gemini Flash basic
    results.append(await test_tts(
        "Gemini Flash - Kore",
        model_name="gemini-2.5-flash-tts",
        voice_name="Kore",
        language="en-US",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # Test 5: Gemini Flash with prompt
    results.append(await test_tts(
        "Gemini Flash - Puck with prompt",
        model_name="gemini-2.5-flash-tts",
        voice_name="Puck",
        language="en-US",
        prompt="Speak in a friendly and casual tone",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # Test 6: Gemini Flash different voice
    results.append(await test_tts(
        "Gemini Flash - Charon",
        model_name="gemini-2.5-flash-tts",
        voice_name="Charon",
        language="en-US",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # Test 7: Gemini Flash - Aoede
    results.append(await test_tts(
        "Gemini Flash - Aoede",
        model_name="gemini-2.5-flash-tts",
        voice_name="Aoede",
        language="en-US",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # === GEMINI 2.5 PRO TTS ===
    print("\n=== GEMINI 2.5 PRO TTS ===")
    
    # Test 8: Gemini Pro basic
    results.append(await test_tts(
        "Gemini Pro - Callirrhoe",
        model_name="gemini-2.5-pro-tts",
        voice_name="Callirrhoe",
        language="en-US",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # Test 9: Gemini Pro with detailed prompt
    results.append(await test_tts(
        "Gemini Pro - with detailed prompt",
        model_name="gemini-2.5-pro-tts",
        voice_name="Kore",
        language="en-US",
        prompt="You are a professional narrator. Speak clearly and confidently.",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # === MULTI-LANGUAGE ===
    print("\n=== MULTI-LANGUAGE ===")
    
    # Test 10: Korean
    results.append(await test_tts(
        "Gemini Flash - Korean",
        model_name="gemini-2.5-flash-tts",
        voice_name="Kore",
        language="ko-KR",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # Test 11: Japanese
    results.append(await test_tts(
        "Gemini Flash - Japanese",
        model_name="gemini-2.5-flash-tts",
        voice_name="Aoede",
        language="ja-JP",
        audio_encoding=texttospeech.AudioEncoding.PCM,
    ))
    
    # === EDGE CASES ===
    print("\n=== EDGE CASES ===")
    
    # Test 12: Update options at runtime
    try:
        tts = TTS(model_name="gemini-2.5-flash-tts", voice_name="Kore", credentials_file=CREDENTIALS_FILE)
        tts.update_options(voice_name="Puck", prompt="Speak slowly")
        print(f"[PASS] update_options() - voice changed to Puck")
        results.append(True)
    except Exception as e:
        print(f"[FAIL] update_options() - {e}")
        results.append(False)
    
    # Test 13: Model property returns correct value
    try:
        tts1 = TTS(credentials_file=CREDENTIALS_FILE)  # Default
        tts2 = TTS(model_name="gemini-2.5-flash-tts", voice_name="Kore", credentials_file=CREDENTIALS_FILE)
        assert tts1.model == "Chirp3", f"Expected 'Chirp3', got '{tts1.model}'"
        assert tts2.model == "gemini-2.5-flash-tts", f"Expected 'gemini-2.5-flash-tts', got '{tts2.model}'"
        print(f"[PASS] model property - Chirp3={tts1.model}, Gemini={tts2.model}")
        results.append(True)
    except Exception as e:
        print(f"[FAIL] model property - {e}")
        results.append(False)
    
    # Summary
    print(f"\n=== SUMMARY ===")
    passed = sum(results)
    total = len(results)
    print(f"Passed: {passed}/{total}")

if __name__ == "__main__":
    asyncio.run(main())

=== CHIRP3 BACKWARD COMPATIBILITY ===
[PASS] Chirp3 Default - 14 chunks received
[PASS] Chirp3 Explicit Voice - 14 chunks received
[PASS] Chirp3 with speaking_rate - 13 chunks received

=== GEMINI 2.5 FLASH TTS ===
[PASS] Gemini Flash - Kore - 27 chunks received
[PASS] Gemini Flash - Puck with prompt - 22 chunks received
[PASS] Gemini Flash - Charon - 29 chunks received
[PASS] Gemini Flash - Aoede - 25 chunks received

=== GEMINI 2.5 PRO TTS ===
[PASS] Gemini Pro - Callirrhoe - 19 chunks received
[PASS] Gemini Pro - with detailed prompt - 24 chunks received

=== MULTI-LANGUAGE ===
[PASS] Gemini Flash - Korean - 30 chunks received
[PASS] Gemini Flash - Japanese - 26 chunks received

=== EDGE CASES ===
[PASS] update_options() - voice changed to Puck
[PASS] model property - Chirp3=Chirp3, Gemini=gemini-2.5-flash-tts

=== SUMMARY ===
Passed: 13/13

@tinalenguyen
Copy link
Member

Hi, thank you for your support and PR 😄! We'll take a look, could you also sign the Contributor License Agreement?

@plumber0
Copy link
Contributor Author

plumber0 commented Dec 8, 2025

Thanks for the quick response! Just signed the CLA. Let me know if you need anything else.

Copy link
Member

@theomonnom theomonnom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks!

@theomonnom theomonnom merged commit 41fc8fd into livekit:main Dec 8, 2025
7 of 9 checks passed
@singhkushank
Copy link

When will this feature go live?

meetakshay99 added a commit to meetakshay99/agents that referenced this pull request Dec 12, 2025
* main: (267 commits)
  AGT-2328: negative threshold in silero (livekit#4228)
  disable interruptions for agent greeting (livekit#4223)
  feature: GPT-5.2 support (livekit#4235)
  turn-detector: remove english model from readme (livekit#4233)
  add keep alive task for liveavatar plugin (livekit#4231)
  feat(warm-transfer): add sip_number parameter for outbound caller ID (livekit#4216)
  fix blocked send task in liveavatar plugin (livekit#4214)
  clear _q_updated right after await to avoid race conditions (livekit#4209)
  ensure playback_segments_count is consistent in the audio output chain (livekit#4211)
  fix inworld punctuation handling (livekit#4215)
  Inference: Rename fallback model name param (livekit#4202)
  fix race condition when stop background audio play handle (livekit#4197)
  fix watchfiles prevent agent prcoess exit on sigterm (livekit#4194)
  feat(google): add streaming support for Gemini TTS models (livekit#4189)
  Add LiveAvatar Stop Session API Call + README Fix (livekit#4195)
  Fallback API for Inference (livekit#4099)
  feat(rime): expand update_options to accept all TTS parameters (livekit#4095)
  mistralai models update (livekit#4156)
  fix record.exc_info is not pickable when using LogQueueHandler (livekit#4185)
  Restore otel chat message (livekit#4118)
  ...
@james-intallaga
Copy link

Thanks for the contribution, but when I used the stream fix, i got this error: ❌ Critical error in entrypoint: TTS.init() got an unexpected keyword argument 'model_name' , if there something I did wrong: tts=TTS(
model_name="gemini-2.5-flash-tts",
voice_name="leda",
prompt="naturally conversational"
), and the import is: from livekit.plugins.google import TTS

@agent-felix
Copy link
Contributor

Thanks for trying it out! @james-intallaga Two things:

  1. Install the latest release - this feature shipped in 1.3.7 (released today):
    pip install --upgrade livekit-plugins-google
  2. Specify PCM encoding for Gemini - Per Google's docs, Gemini streaming defaults to PCM, but the plugin defaults to OGG_OPUS (for Chirp3 compatibility). When using Gemini models, specify:
   from google.cloud import texttospeech
   
   tts = TTS(
       model_name="gemini-2.5-flash-tts",
       voice_name="Leda",
       prompt="naturally conversational",
       audio_encoding=texttospeech.AudioEncoding.PCM,
   )

I'll open a follow-up PR to auto-set PCM when a Gemini model is detected.

@james-intallaga
Copy link

sorry, but it still does not work, do we need to get a cloud api-key instead of gemini api key only?

@james-intallaga
Copy link

I set the cloud api and it works now, thanks very much. The latency is still very big(about 1-3 seconds, llm model is gemini 3 flash), I am not sure if it is normal.

@plumber0
Copy link
Contributor Author

@james-intallaga

As I know, 1-3 seconds is typical for a full voice pipeline. The latency comes from multiple stages:

  • Turn Detector: ~0.5-1s (waiting to confirm user stopped speaking)
  • LLM Completion: ~1-2s (full response generation before TTS starts)
  • TTS TTFB: ~0.5-1s (time to first audio byte)

This is expected behavior for STT -> LLM -> TTS pipelines.

Here's an example from my logs:

2025-12-19 11:53:30,152 - [METRICS:LLM] ttft=1.367s tokens=3883 (prompt=3785, completion=98, cached=2017) tokens/s=48.0 duration=2.04s
2025-12-19 11:53:30,152 - [TIMELINE:LLM] started -> +1.367s first_token -> +0.673s completed (total 2.040s)
2025-12-19 11:53:48,443 - [METRICS:TTS] ttfb=0.810s audio=25.03s chars=150 duration=18.29s streamed=True
2025-12-19 11:53:48,443 - [TIMELINE:TTS] started -> +0.810s first_audio -> +17.478s completed (total 18.288s)

In this example, TTS started after LLM completion. Once TTS starts, it streams audio chunks correctly (ttfb=0.810s, streamed=True). 1-3 seconds end-to-end is expected for voice pipelines.

@james-intallaga
Copy link

Thanks for the detailed analysis, solid! And it does make sense to have such latency with current pipeline of the whole industry, but hopefully some new structure will emerge in 2026 and reduce the latency under 1 sec. A voice chatbot with a latency bigger than 1 sec for each response in voice will drive users away. Realtime conversation will then be real realtime conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Google TTS (Gemini) now supports streaming — any plans to integrate it into LiveKit?

7 participants