Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186

LostRuins · 2025-01-11T09:10:09Z

Hi, I have found a technique that greatly improves the accuracy of OuteTTS when handling long inputs by using guide tokens.

Initial Problem: When generating TTS for a long input, the model starts to get confused and outputs gibberish. The longer the input prompt gets, the worse the output. This occurs because, as a small LLM, the model has imperfect recall - it's required to recite back the input tokens in sequence but occasionally gets them wrong - it simply predicts the wrong token every so often, which cascades down into incoherence the longer the output gets.

Consider this example, taken from the readme.

The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Plain C/C++ implementation without any dependencies. Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. AVX, AVX2, AVX512 and AMX support for x86 architectures. 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use. Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads MTT GPUs via MUSA). Vulkan and SYCL backend support. CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity.

before.mp4

The model gets the starting correct, but eventually stumbles, and then rapidly degrades into rubbish output.

Solution: We add an optional flag --tts-use-guide-tokens. When enabled, the narrated text is split into individual words, and then the start token ID of each word is "enforced" onto the output between word boundaries. This essentially forms a sort of dynamic grammar, steering the start of each output text token to be correct, and leading to correct audio data tokens as a result. Even if the model stumbles, it has the capacity to self correct at the next word instead of going delulu.

Here's the same text narrated with --tts-use-guide-tokens enabled.

after.mp4

The end result is nearly perfect recitation over any length of text.

…TTS recitation accuracy over long input sequences.

ggerganov · 2025-01-11T09:54:46Z

On first look this seems very clever. I was originally thinking that the model inherently re-generating the input tokens is a limitation as it reduces the effective context size. But now it seems that it allows to do this kind of forced-guidance which might be a big advantage. I'm also curious about @edwko's thoughts on this.

I think the current implementation assumes that each word consist of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word.

LostRuins · 2025-01-11T10:11:26Z

On first look this seems very clever. I was originally thinking that the model inherently re-generating the input tokens is a limitation as it reduces the effective context size. But now it seems that it allows to do this kind of forced-guidance which might be a big advantage. I'm also curious about @edwko's thoughts on this.

I think the current implementation assumes that each word consist of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word.

It doesn't break, and in fact there are multiple words in the above example that comprise of more than one token. The idea is to just give the model a gentle "nudge" in the right direction.

For example, the snippet CUDA kernels for has the word kernels which tokenizes to [74 (k), 42329 (ernels)]. The forced guide token in this case is only 74 (k), and the LLM is free to continue after that (it's much more likely to be correct). It might guess k in (258) = "kin" instead, but that's fine because you just end up with one wrong word - the next guide token will still be 1958 (for) which will lead the output back on track. The guide token only ever triggers after the start a word boundary (id==198), not after every token. So multi token sequences will work fine.

Illustrating another example

...
151670 = <|code_end|>
198 = [boundary] (trigger guide token next)
265 = re (forced token)
36369 = placed
155823 = <|t_0.51|>
151669 = <|code_start|>
152429 = <|757|>
152338 = <|666|>
...

A more complex algorithm might queue all the forced tokens of a word instead of just the first, but either approach will work reasonably well.

edwko · 2025-01-11T10:22:24Z

This is really clever, never actually thought of doing forced insertion like this! :)

I think the current implementation assumes that each word consists of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word.

No, the model can and does handle multi-token words, it uses the exact same tokenizer as the base model. At the word level, it works like this: a single word (or multi-token word) → [word tokens, e.g., 100, 101, 102] → [time token] → [audio codes]. I also added a special space token between words because, for example, "hello" and " hello" have different tokens. This keeps things consistent and avoids confusing the model.

As I mentioned in another PR comment, in your example the model goes off at around 15 second plus speaker reference so around 25-30 seconds. That aligns with the that middle range. Just as a test, you can see how it performs without a speaker reference:

With speaker:
"The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Plain C/C++ implementation without any dependencies. Apple silicon"

Without speaker gets correctly:
"The main goal of llama.cpp is to enable LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware - locally and in the cloud. Plain C/C++ implementation without any dependencies. Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks. AVX, AVX2, AVX512 and AMX support"

output_test.mp4

LostRuins · 2025-01-11T10:24:12Z

Btw @edwko another issue i'm facing right now is with the lack of punctuation causing awkward pausing at wrong places - is that a limitation from outetts, or is it just not yet implemented here?

edwko · 2025-01-11T10:28:10Z

Btw @edwko another issue i'm facing right now is with the lack of punctuation causing awkward pausing at wrong places - is that a limitation from outetts, or is it just not yet implemented here?

The version 0.1 and 0.2 doesn’t have punctuation support, but it’s coming soon in the next release 0.3, I’ve added punctuation support, so this issue will be resolved. :)

LostRuins · 2025-01-17T14:10:15Z

Is there still any interest in this PR? I can update it for the vocab refactor if desired.

ggerganov

Yes, let's merge it after updating to latest master.

examples/tts/tts.cpp

…dded a safety check, added newline to guide token start

LostRuins · 2025-01-18T04:54:13Z

alright, made the vocab updates and linting changes, think its ready to merge if all is good

* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start

Added the ability to use guide tokens for OuteTTS, greatly improving …

c9d1eb3

…TTS recitation accuracy over long input sequences.

LostRuins requested a review from ggerganov January 11, 2025 09:10

github-actions bot added the examples label Jan 11, 2025

LostRuins mentioned this pull request Jan 11, 2025

tts : add OuteTTS support #10784

Merged

9 tasks

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Jan 11, 2025

add ability to use guide tokens for TTS, ref: ggerganov#11186

07173e8

LostRuins added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jan 11, 2025

LostRuins requested a review from ngxson January 11, 2025 09:28

ggerganov approved these changes Jan 17, 2025

View reviewed changes

examples/tts/tts.cpp Outdated Show resolved Hide resolved

examples/tts/tts.cpp Outdated Show resolved Hide resolved

examples/tts/tts.cpp Outdated Show resolved Hide resolved

LostRuins added 2 commits January 18, 2025 11:27

Merge branch 'master' into cedo/tts-guide-tokens

9fa3042

applied linting suggestions, updated to latest llama_vocab changes, a…

a6013de

…dded a safety check, added newline to guide token start

LostRuins requested a review from ggerganov January 18, 2025 04:53

ggerganov merged commit 6390a99 into master Jan 18, 2025
50 checks passed

ggerganov deleted the cedo/tts-guide-tokens branch January 18, 2025 10:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186

Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186

LostRuins commented Jan 11, 2025

ggerganov commented Jan 11, 2025

LostRuins commented Jan 11, 2025 •

edited

Loading

edwko commented Jan 11, 2025

LostRuins commented Jan 11, 2025

edwko commented Jan 11, 2025

LostRuins commented Jan 17, 2025

ggerganov left a comment

LostRuins commented Jan 18, 2025 •

edited

Loading

Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186

Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186

Conversation

LostRuins commented Jan 11, 2025

ggerganov commented Jan 11, 2025

LostRuins commented Jan 11, 2025 • edited Loading

edwko commented Jan 11, 2025

LostRuins commented Jan 11, 2025

edwko commented Jan 11, 2025

LostRuins commented Jan 17, 2025

ggerganov left a comment

Choose a reason for hiding this comment

LostRuins commented Jan 18, 2025 • edited Loading

LostRuins commented Jan 11, 2025 •

edited

Loading

LostRuins commented Jan 18, 2025 •

edited

Loading