Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Open
Rqcker opened this issue Jan 31, 2025 · 5 comments
Open

Speaker Consistency Issue in TTS Workflow (NotebookLlama) #870

Rqcker opened this issue Jan 31, 2025 · 5 comments
Assignees

Comments

@Rqcker
Copy link

Rqcker commented Jan 31, 2025

I have tested the TTS workflow in Step-4-TTS-Workflow.ipynb while generating a podcast with two speakers. However, I noticed that the voices generated for the same speaker are not consistent across different utterances. The voice characteristics fluctuate, making it difficult to maintain the same tone in each speaker.

@Rqcker
Copy link
Author

Rqcker commented Jan 31, 2025

Although I know that NotebookLlama uses two external TTS models, Bark and Parlet-TTS, I think that for the basic generation of podcasts, two people with different voices are enough, rather than a new person with a new voice for each sentence. This kind of non-fixed voice usage and TTS configuration is indeed problematic.

@init27
Copy link
Contributor

init27 commented Jan 31, 2025

@Rqcker thanks for checking out the project and the flag!

This is a great catch and is indeed a problem with the current TTS models. I would recommend checking here for some notes on TTS model exploration or also trying the latest ones-Bark and Parler are not the SOTA as you might know.

Please let me know if youve any Qs!

@Rqcker
Copy link
Author

Rqcker commented Jan 31, 2025

@init27 Thanks for confirming. Maybe we can discuss how to unify the tone of the two models Meta uses. I also saw that Meta conducted a comprehensive analysis of these two models.

These two are black boxes to us. Maybe we can turn each person's words into a long string and input it into two black boxes at once while marking the conversation timestamp and then insert the conversations of the two people accordingly. Maybe we can achieve the same tone?

@init27
Copy link
Contributor

init27 commented Jan 31, 2025

This is a great point but unfortunately the models cannot product long text in a consistent tone regardless of the approach-this is something I observed as well when testing them out.

New(wer) TTS models are slightly better at this but don't solve this issue completely still.

@Rqcker
Copy link
Author

Rqcker commented Feb 1, 2025

@init27 Could you let me know the names if you guys have tried any SOTA or newer models (exclude TTSNOTES)? I saw in the efforts you made on Bark and Parler. But the need for consistent sound quality cannot be met on podcast. That would help me out.

Also the NotebookLM generated Podcasts on Google's side can really be consistent. I think they use their own TTS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants