Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Rqcker · 2025-01-31T18:35:40Z

I have tested the TTS workflow in Step-4-TTS-Workflow.ipynb while generating a podcast with two speakers. However, I noticed that the voices generated for the same speaker are not consistent across different utterances. The voice characteristics fluctuate, making it difficult to maintain the same tone in each speaker.

Rqcker · 2025-01-31T20:12:50Z

Although I know that NotebookLlama uses two external TTS models, Bark and Parlet-TTS, I think that for the basic generation of podcasts, two people with different voices are enough, rather than a new person with a new voice for each sentence. This kind of non-fixed voice usage and TTS configuration is indeed problematic.

init27 · 2025-01-31T20:57:18Z

@Rqcker thanks for checking out the project and the flag!

This is a great catch and is indeed a problem with the current TTS models. I would recommend checking here for some notes on TTS model exploration or also trying the latest ones-Bark and Parler are not the SOTA as you might know.

Please let me know if youve any Qs!

Rqcker · 2025-01-31T22:00:53Z

@init27 Thanks for confirming. Maybe we can discuss how to unify the tone of the two models Meta uses. I also saw that Meta conducted a comprehensive analysis of these two models.

These two are black boxes to us. Maybe we can turn each person's words into a long string and input it into two black boxes at once while marking the conversation timestamp and then insert the conversations of the two people accordingly. Maybe we can achieve the same tone?

init27 · 2025-01-31T22:42:10Z

This is a great point but unfortunately the models cannot product long text in a consistent tone regardless of the approach-this is something I observed as well when testing them out.

New(wer) TTS models are slightly better at this but don't solve this issue completely still.

Rqcker · 2025-02-01T13:47:51Z

@init27 Could you let me know the names if you guys have tried any SOTA or newer models (exclude TTSNOTES)? I saw in the efforts you made on Bark and Parler. But the need for consistent sound quality cannot be met on podcast. That would help me out.

Also the NotebookLM generated Podcasts on Google's side can really be consistent. I think they use their own TTS.

varunfb assigned init27 Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Rqcker commented Jan 31, 2025

Rqcker commented Jan 31, 2025

init27 commented Jan 31, 2025

Rqcker commented Jan 31, 2025

init27 commented Jan 31, 2025

Rqcker commented Feb 1, 2025 •

edited

Loading

Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Comments

Rqcker commented Jan 31, 2025

Rqcker commented Jan 31, 2025

init27 commented Jan 31, 2025

Rqcker commented Jan 31, 2025

init27 commented Jan 31, 2025

Rqcker commented Feb 1, 2025 • edited Loading

Rqcker commented Feb 1, 2025 •

edited

Loading