Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS Generator is broken #506

Open
unifirer opened this issue Jan 30, 2025 · 2 comments
Open

TTS Generator is broken #506

unifirer opened this issue Jan 30, 2025 · 2 comments

Comments

@unifirer
Copy link

unifirer commented Jan 30, 2025

diagnostics.log

texts using the main window runs fine but in TTS Generator anything longer than 2 sentences outputs silence and screeches

the transcribe inside TTS Generator also doesnt work reliably, more than half the times for a book length everything has a forbidden sign after completion of TTS Generator

To Reproduce
put a paragraph into TTS Generator setting Chunk Sizes: any size and generate

logs are normal

Desktop (please complete the following information):
AllTalk was updated: 30/01/25
Custom Python environment: no
Text-generation-webUI was updated: using standalone

Additional context
please bring back no character limit for normal generation, it actually didnt have major problems

@unifirer
Copy link
Author

edit: added log file, i forgot

@erew123
Copy link
Owner

erew123 commented Jan 31, 2025

@unifirer

TTS Engines & TTS Generation

  1. Each of the TTS engines inside AllTalk has certain limits built in. These are nothing to do with limits that I set, these are manufacturer limits. E.g. The XTTS model has a tokenizer limit of 250 Characters in English per TTS generation before quality drops off. For references, please see here 250 character Limit - How to get over it? coqui-ai/TTS#3548 or generally search the Coqui Github or Google.

So for example, with XTTS in English, sending a generation of a block of text longer than 250 characters can result in drop offs in quality, strange sounds etc. This with XTTS varies by language used too, e.g. I think the limit on Chinese is 200 characters etc. There is internal code in some TTS engines to handle text splitting in some ways for blocks of TTS longer than X, but that varies in quality and what it can do.

As mentioned, I am not the manufacturer of any of the underlying TTS engines themselves. You can find links to each of the TTS engine manufacturers sites within the AllTalk interface against each engine OR you can find a current list of links here where you can research each TTS engine and the manufacturers specifications.

  1. Because TTS engines may have limits, this is part of the reason there is a "chunk" size setting within the TTS generator. This allows you a way to:

    • Mostly ensure you dont hit those limits for your chosen TTS engine on a per-sentence TTS generation.
    • Regenerate a line of text if required.
  2. Both the main window and the TTS generator actually send their TTS generations to the same API endpoint within AllTalk to hand over to the underlying TTS engine. There is no actual difference here. So if you send the text in the main window OR the TTS Generator, both sets of text would be handled and TTS generated in exactly the same way. There is no actual code difference bar the TTS generator can split large blocks of text up and send them as individual generations to whatever underlying TTS engine you have selected. So there is no code differences in so far as how actual text for tts is handled at point of TTS generation. The only difference is the underlying TTS engine and its capabilities, which is manufacturer specific.

Analysis

I assume by "transcribe" and "Forbidden" you mean items being highlighted red in the window with the analysis option?

The Analysis option is just a guide to say "hey you might want to check this was generated as TTS correctly". There is no way to 1 to 1, 100% confirm that the input text sent for TTS generation will match the output TTS. Here is an example...

Image

The input text is ...Sorry, uh...allergies, and has multiple periods before the sorry and allergies, The ... will be removed by the underlying TTS engine/not pronounced as a sound that can be generated as TTS (as with all Punctuation).

So when Whisper transcribes the audio of the generated TTS and the original text is compared you get:

Original Text: Think I left a lotta stuff behind that day. ...Sorry, uh...allergies,
Transcribed Text : Think I left a lot of stuff behind that day. Sorry, allergies.

And therefore a difference in the 2x texts because the original has the multiple periods and the transcribed TTS has removed those, because they are not pronounced. As such the Analysis marks the lines red, because there is a difference that may or may not matter. It may or may not sound right. It may or may not be because you have hit the token length limit of an underlying TTS engine and so the output is garbled to some degree, it may be there was additional punctuation in the original text that isnt going to appear in the generated audio TTS and therefore cannot be transcribed by Whisper.

Hence, marking things red is just a guideline.

There is a % accuracy setting:

Image

Which can be used to allow more things to pass by.... Take the example of the word "there" being in the original text. Whisper may transcribe that as:

- there
- their
- they're

Which all sound the same but have different spellings. So if the text has it spelt one way and Whisper transcribes it as another way, all I can do is say "hey the original text and the transcribed text are spelt differently, maybe this sounds wrong". Please bear in mind Whisper doesnt have an understanding of context of sentences when transcribing, so it will have no clue to say which variation of "there" is the correct one.

This issue, obviously extends out into many different words and languages.

So, you can set the % accuracy to be a bit more flexible by lowering the % accuracy. So at a lower % accuracy it would allow words that sound similar but may not be spelt exactly the same e.g. "there, their, they're"

You can also choose different Whisper models which may or may not work better. You can find all the information about the Whisper models here https://github.com/openai/whisper

Hopefully this gives you a better understanding and covers what you need to know.

Please also read my statement of my support availability as I am severely limited with any interaction I can now make on AllTalk

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants