Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186

Merged
merged 3 commits into from
Jan 18, 2025

Conversation

LostRuins
Copy link
Collaborator

@ggerganov @edwko

ref: #10784 (comment)

Hi, I have found a technique that greatly improves the accuracy of OuteTTS when handling long inputs by using guide tokens.

Initial Problem: When generating TTS for a long input, the model starts to get confused and outputs gibberish. The longer the input prompt gets, the worse the output. This occurs because, as a small LLM, the model has imperfect recall - it's required to recite back the input tokens in sequence but occasionally gets them wrong - it simply predicts the wrong token every so often, which cascades down into incoherence the longer the output gets.

Consider this example, taken from the readme.

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Plain C/C++ implementation without any dependencies. Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. AVX, AVX2, AVX512 and AMX support for x86 architectures. 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use. Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA). Vulkan and SYCL backend support. CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity.

before.mp4

The model gets the starting correct, but eventually stumbles, and then rapidly degrades into rubbish output.

Solution: We add an optional flag --tts-use-guide-tokens. When enabled, the narrated text is split into individual words, and then the start token ID of each word is "enforced" onto the output between word boundaries. This essentially forms a sort of dynamic grammar, steering the start of each output text token to be correct, and leading to correct audio data tokens as a result. Even if the model stumbles, it has the capacity to self correct at the next word instead of going delulu.

Here's the same text narrated with --tts-use-guide-tokens enabled.

after.mp4

The end result is nearly perfect recitation over any length of text.

…TTS recitation accuracy over long input sequences.
@LostRuins LostRuins requested a review from ggerganov January 11, 2025 09:10
@LostRuins LostRuins mentioned this pull request Jan 11, 2025
9 tasks
LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Jan 11, 2025
@LostRuins LostRuins added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jan 11, 2025
@LostRuins LostRuins requested a review from ngxson January 11, 2025 09:28
@ggerganov
Copy link
Owner

On first look this seems very clever. I was originally thinking that the model inherently re-generating the input tokens is a limitation as it reduces the effective context size. But now it seems that it allows to do this kind of forced-guidance which might be a big advantage. I'm also curious about @edwko's thoughts on this.

I think the current implementation assumes that each word consist of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word.

@LostRuins
Copy link
Collaborator Author

LostRuins commented Jan 11, 2025

On first look this seems very clever. I was originally thinking that the model inherently re-generating the input tokens is a limitation as it reduces the effective context size. But now it seems that it allows to do this kind of forced-guidance which might be a big advantage. I'm also curious about @edwko's thoughts on this.

I think the current implementation assumes that each word consist of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word.

It doesn't break, and in fact there are multiple words in the above example that comprise of more than one token. The idea is to just give the model a gentle "nudge" in the right direction.

For example, the snippet CUDA kernels for has the word kernels which tokenizes to [74 (k), 42329 (ernels)]. The forced guide token in this case is only 74 (k), and the LLM is free to continue after that (it's much more likely to be correct). It might guess k in (258) = "kin" instead, but that's fine because you just end up with one wrong word - the next guide token will still be 1958 (for) which will lead the output back on track. The guide token only ever triggers after the start a word boundary (id==198), not after every token. So multi token sequences will work fine.

Illustrating another example

...
151670 = <|code_end|>
198 = [boundary] (trigger guide token next)
265 = re (forced token)
36369 = placed
155823 = <|t_0.51|>
151669 = <|code_start|>
152429 = <|757|>
152338 = <|666|>
...

A more complex algorithm might queue all the forced tokens of a word instead of just the first, but either approach will work reasonably well.

@edwko
Copy link

edwko commented Jan 11, 2025

This is really clever, never actually thought of doing forced insertion like this! :)

I think the current implementation assumes that each word consists of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word.

No, the model can and does handle multi-token words, it uses the exact same tokenizer as the base model. At the word level, it works like this: a single word (or multi-token word) → [word tokens, e.g., 100, 101, 102] → [time token] → [audio codes]. I also added a special space token between words because, for example, "hello" and " hello" have different tokens. This keeps things consistent and avoids confusing the model.

As I mentioned in another PR comment, in your example the model goes off at around 15 second plus speaker reference so around 25-30 seconds. That aligns with the that middle range. Just as a test, you can see how it performs without a speaker reference:

With speaker:
"The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Plain C/C++ implementation without any dependencies. Apple silicon"

Without speaker gets correctly:
"The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Plain C/C++ implementation without any dependencies. Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. AVX, AVX2, AVX512 and AMX support"

output_test.mp4

@LostRuins
Copy link
Collaborator Author

Btw @edwko another issue i'm facing right now is with the lack of punctuation causing awkward pausing at wrong places - is that a limitation from outetts, or is it just not yet implemented here?

@edwko
Copy link

edwko commented Jan 11, 2025

Btw @edwko another issue i'm facing right now is with the lack of punctuation causing awkward pausing at wrong places - is that a limitation from outetts, or is it just not yet implemented here?

The version 0.1 and 0.2 doesn’t have punctuation support, but it’s coming soon in the next release 0.3, I’ve added punctuation support, so this issue will be resolved. :)

@LostRuins
Copy link
Collaborator Author

Is there still any interest in this PR? I can update it for the vocab refactor if desired.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's merge it after updating to latest master.

examples/tts/tts.cpp Outdated Show resolved Hide resolved
examples/tts/tts.cpp Outdated Show resolved Hide resolved
examples/tts/tts.cpp Outdated Show resolved Hide resolved
@LostRuins LostRuins requested a review from ggerganov January 18, 2025 04:53
@LostRuins
Copy link
Collaborator Author

LostRuins commented Jan 18, 2025

alright, made the vocab updates and linting changes, think its ready to merge if all is good

@ggerganov ggerganov merged commit 6390a99 into master Jan 18, 2025
50 checks passed
@ggerganov ggerganov deleted the cedo/tts-guide-tokens branch January 18, 2025 10:21
anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025
* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences.

* applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants