TTS Generator is broken #506

unifirer · 2025-01-30T20:26:39Z

texts using the main window runs fine but in TTS Generator anything longer than 2 sentences outputs silence and screeches

the transcribe inside TTS Generator also doesnt work reliably, more than half the times for a book length everything has a forbidden sign after completion of TTS Generator

To Reproduce
put a paragraph into TTS Generator setting Chunk Sizes: any size and generate

logs are normal

Desktop (please complete the following information):
AllTalk was updated: 30/01/25
Custom Python environment: no
Text-generation-webUI was updated: using standalone

Additional context
please bring back no character limit for normal generation, it actually didnt have major problems

unifirer · 2025-01-31T03:30:16Z

edit: added log file, i forgot

erew123 · 2025-01-31T08:25:38Z

@unifirer

TTS Engines & TTS Generation

Each of the TTS engines inside AllTalk has certain limits built in. These are nothing to do with limits that I set, these are manufacturer limits. E.g. The XTTS model has a tokenizer limit of 250 Characters in English per TTS generation before quality drops off. For references, please see here 250 character Limit - How to get over it? coqui-ai/TTS#3548 or generally search the Coqui Github or Google.

So for example, with XTTS in English, sending a generation of a block of text longer than 250 characters can result in drop offs in quality, strange sounds etc. This with XTTS varies by language used too, e.g. I think the limit on Chinese is 200 characters etc. There is internal code in some TTS engines to handle text splitting in some ways for blocks of TTS longer than X, but that varies in quality and what it can do.

As mentioned, I am not the manufacturer of any of the underlying TTS engines themselves. You can find links to each of the TTS engine manufacturers sites within the AllTalk interface against each engine OR you can find a current list of links here where you can research each TTS engine and the manufacturers specifications.

Because TTS engines may have limits, this is part of the reason there is a "chunk" size setting within the TTS generator. This allows you a way to:
- Mostly ensure you dont hit those limits for your chosen TTS engine on a per-sentence TTS generation.
- Regenerate a line of text if required.
Both the main window and the TTS generator actually send their TTS generations to the same API endpoint within AllTalk to hand over to the underlying TTS engine. There is no actual difference here. So if you send the text in the main window OR the TTS Generator, both sets of text would be handled and TTS generated in exactly the same way. There is no actual code difference bar the TTS generator can split large blocks of text up and send them as individual generations to whatever underlying TTS engine you have selected. So there is no code differences in so far as how actual text for tts is handled at point of TTS generation. The only difference is the underlying TTS engine and its capabilities, which is manufacturer specific.

Analysis

I assume by "transcribe" and "Forbidden" you mean items being highlighted red in the window with the analysis option?

The Analysis option is just a guide to say "hey you might want to check this was generated as TTS correctly". There is no way to 1 to 1, 100% confirm that the input text sent for TTS generation will match the output TTS. Here is an example...

The input text is ...Sorry, uh...allergies, and has multiple periods before the sorry and allergies, The ... will be removed by the underlying TTS engine/not pronounced as a sound that can be generated as TTS (as with all Punctuation).

So when Whisper transcribes the audio of the generated TTS and the original text is compared you get:

Original Text: Think I left a lotta stuff behind that day. ...Sorry, uh...allergies,
Transcribed Text : Think I left a lot of stuff behind that day. Sorry, allergies.

And therefore a difference in the 2x texts because the original has the multiple periods and the transcribed TTS has removed those, because they are not pronounced. As such the Analysis marks the lines red, because there is a difference that may or may not matter. It may or may not sound right. It may or may not be because you have hit the token length limit of an underlying TTS engine and so the output is garbled to some degree, it may be there was additional punctuation in the original text that isnt going to appear in the generated audio TTS and therefore cannot be transcribed by Whisper.

Hence, marking things red is just a guideline.

There is a % accuracy setting:

Which can be used to allow more things to pass by.... Take the example of the word "there" being in the original text. Whisper may transcribe that as:

- there
- their
- they're

Which all sound the same but have different spellings. So if the text has it spelt one way and Whisper transcribes it as another way, all I can do is say "hey the original text and the transcribed text are spelt differently, maybe this sounds wrong". Please bear in mind Whisper doesnt have an understanding of context of sentences when transcribing, so it will have no clue to say which variation of "there" is the correct one.

This issue, obviously extends out into many different words and languages.

So, you can set the % accuracy to be a bit more flexible by lowering the % accuracy. So at a lower % accuracy it would allow words that sound similar but may not be spelt exactly the same e.g. "there, their, they're"

You can also choose different Whisper models which may or may not work better. You can find all the information about the Whisper models here https://github.com/openai/whisper

Hopefully this gives you a better understanding and covers what you need to know.

Please also read my statement of my support availability as I am severely limited with any interaction I can now make on AllTalk

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTS Generator is broken #506

TTS Generator is broken #506

unifirer commented Jan 30, 2025 •

edited

Loading

unifirer commented Jan 31, 2025

erew123 commented Jan 31, 2025

TTS Generator is broken #506

TTS Generator is broken #506

Comments

unifirer commented Jan 30, 2025 • edited Loading

unifirer commented Jan 31, 2025

erew123 commented Jan 31, 2025

TTS Engines & TTS Generation

Analysis

unifirer commented Jan 30, 2025 •

edited

Loading