-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. #11186
Conversation
…TTS recitation accuracy over long input sequences.
On first look this seems very clever. I was originally thinking that the model inherently re-generating the input tokens is a limitation as it reduces the effective context size. But now it seems that it allows to do this kind of forced-guidance which might be a big advantage. I'm also curious about @edwko's thoughts on this. I think the current implementation assumes that each word consist of a single token, which is not the case and the forced-guidance logic would break after the first multi-token word. |
It doesn't break, and in fact there are multiple words in the above example that comprise of more than one token. The idea is to just give the model a gentle "nudge" in the right direction. For example, the snippet Illustrating another example
A more complex algorithm might queue all the forced tokens of a word instead of just the first, but either approach will work reasonably well. |
This is really clever, never actually thought of doing forced insertion like this! :)
No, the model can and does handle multi-token words, it uses the exact same tokenizer as the base model. At the word level, it works like this: a single word (or multi-token word) → [word tokens, e.g., 100, 101, 102] → [time token] → [audio codes]. I also added a special space token between words because, for example, "hello" and " hello" have different tokens. This keeps things consistent and avoids confusing the model. As I mentioned in another PR comment, in your example the model goes off at around 15 second plus speaker reference so around 25-30 seconds. That aligns with the that middle range. Just as a test, you can see how it performs without a speaker reference: With speaker: Without speaker gets correctly: output_test.mp4 |
Btw @edwko another issue i'm facing right now is with the lack of punctuation causing awkward pausing at wrong places - is that a limitation from outetts, or is it just not yet implemented here? |
The version 0.1 and 0.2 doesn’t have punctuation support, but it’s coming soon in the next release 0.3, I’ve added punctuation support, so this issue will be resolved. :) |
Is there still any interest in this PR? I can update it for the vocab refactor if desired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's merge it after updating to latest master
.
…dded a safety check, added newline to guide token start
alright, made the vocab updates and linting changes, think its ready to merge if all is good |
* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start
@ggerganov @edwko
ref: #10784 (comment)
Hi, I have found a technique that greatly improves the accuracy of OuteTTS when handling long inputs by using guide tokens.
Initial Problem: When generating TTS for a long input, the model starts to get confused and outputs gibberish. The longer the input prompt gets, the worse the output. This occurs because, as a small LLM, the model has imperfect recall - it's required to recite back the input tokens in sequence but occasionally gets them wrong - it simply predicts the wrong token every so often, which cascades down into incoherence the longer the output gets.
Consider this example, taken from the readme.
before.mp4
The model gets the starting correct, but eventually stumbles, and then rapidly degrades into rubbish output.
Solution: We add an optional flag
--tts-use-guide-tokens
. When enabled, the narrated text is split into individual words, and then the start token ID of each word is "enforced" onto the output between word boundaries. This essentially forms a sort of dynamic grammar, steering the start of each output text token to be correct, and leading to correct audio data tokens as a result. Even if the model stumbles, it has the capacity to self correct at the next word instead of going delulu.Here's the same text narrated with
--tts-use-guide-tokens
enabled.after.mp4
The end result is nearly perfect recitation over any length of text.