-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870
Comments
Although I know that NotebookLlama uses two external TTS models, |
@Rqcker thanks for checking out the project and the flag! This is a great catch and is indeed a problem with the current TTS models. I would recommend checking here for some notes on TTS model exploration or also trying the latest ones-Bark and Parler are not the SOTA as you might know. Please let me know if youve any Qs! |
@init27 Thanks for confirming. Maybe we can discuss how to unify the tone of the two models Meta uses. I also saw that Meta conducted a comprehensive analysis of these two models. These two are black boxes to us. Maybe we can turn each person's words into a long |
This is a great point but unfortunately the models cannot product long text in a consistent tone regardless of the approach-this is something I observed as well when testing them out. New(wer) TTS models are slightly better at this but don't solve this issue completely still. |
@init27 Could you let me know the names if you guys have tried any SOTA or newer models (exclude TTSNOTES)? I saw in the efforts you made on Bark and Parler. But the need for consistent sound quality cannot be met on podcast. That would help me out. Also the NotebookLM generated Podcasts on Google's side can really be consistent. I think they use their own TTS. |
I have tested the TTS workflow in Step-4-TTS-Workflow.ipynb while generating a podcast with two speakers. However, I noticed that the voices generated for the same speaker are not consistent across different utterances. The voice characteristics fluctuate, making it difficult to maintain the same tone in each speaker.
The text was updated successfully, but these errors were encountered: