Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tts : add basic example for text-to-speech #10173

Closed
ggerganov opened this issue Nov 4, 2024 · 5 comments · Fixed by #10784
Closed

tts : add basic example for text-to-speech #10173

ggerganov opened this issue Nov 4, 2024 · 5 comments · Fixed by #10784
Assignees
Labels
good first issue Good for newcomers tts Text-to-speech

Comments

@ggerganov
Copy link
Member

This new model seems suitable for integration: https://github.com/edwko/OuteTTS

We should add a very minimalistic example for generating audio with it. Ideally, we will implement the (audio tokens) -> (wav) from scratch.

@ggerganov ggerganov added good first issue Good for newcomers tts Text-to-speech labels Nov 4, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Nov 4, 2024
@JohannesGaessler
Copy link
Collaborator

Do you have any opinions regarding if and how TTS should be integrated into the server? Directly make it part of the HTTP server? Run another server which the llama.cpp server in turn sends requests to? (The first approach would be more suitable for multimodel models I think, the second one would be more modular.)

@ggerganov
Copy link
Member Author

Do you have any opinions regarding if and how TTS should be integrated into the server?

Not yet. Seems like the biggest question is how to implement the WavTokenizer. If it's too complex, it might have to live in a separate project? With it's own server? Not sure.

Pinging @PABannier as they have experience with encodec.cpp and to my understanding, WavTokenizer is something similar to Encodec?

@ngxson
Copy link
Collaborator

ngxson commented Nov 4, 2024

A bit off-topic, but having some kind of audio-tokenizer.cpp inside llama.cpp will be a very huge deal. It could potentially unlock all the pipeline like TTS, speech-to-text (ASR), speech-to-speech.

@bachittle
Copy link
Contributor

Pinging @PABannier as they have experience with encodec.cpp and to my understanding, WavTokenizer is something similar to Encodec?

The paper mentions Encodec a lot, and says it follows the same paradigm in using a VQ-GAN: https://arxiv.org/pdf/2408.16532 . It is definitely feasible to implement an audio tokenizer here.

@PABannier
Copy link
Contributor

PABannier commented Nov 6, 2024

+1 to @ngxson , tokens to WAV is a big step that bridges the gap between LLMs and TTS models. Encodec is one of those models, and a lot of neural codes are derived from Encodec (see Vocos for example).

Happy to explain in greater details what I did and help integrate Encodec (or a similar model to llama.cpp). As an example of how Encodec integrates after LLMs, you can check Bark.cpp.

FYI, I'm in the process of upstreaming a bench of Metal kernels to ggml which come very handy to support Encodec (ggml_conv_transpose_1d, ggml_elu, ggml_argmax, ggml_set_i32, etc.).

@ggerganov ggerganov self-assigned this Dec 11, 2024
@ggerganov ggerganov moved this from Todo to In Progress in ggml : roadmap Dec 11, 2024
@ggerganov ggerganov moved this from In Progress to Done in ggml : roadmap Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers tts Text-to-speech
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants