AI Voices #2691

Tkael · 2025-01-18T19:06:44Z

Discussed in #2688

^{Originally posted by Transcan January 18, 2025}
Hello,

I'm currently working in a script that reads the messages NPCs send you each of them with a different voice.
I use Spanish and there isn't a lot of voices to choose from, so the messages repeats the same voices too often.

I came across this Piper project. It a project to use locally AI generated voices for TTS.
These new AI voices open up a range of possibilities to choose from.
It looks promising and I wonder if it can be used with EDDI.

Also, the quality of this voices are greater than most of the Windows native voices (The only one that is decent enough, and the one that I use for my personality, is Cortana's voice).

At the moment it doesn't create a system voice (SAPI) that can be used directly with EDDI, and I don't know if that feature will ever exists, but right now I think it can be used with EDDI with minor changes (just my guess, I'm not a professional programmer).

I leave the link to the project here for the masters take a look:
https://github.com/rhasspy/piper

Have a nice day. o7

Tkael · 2025-01-26T21:19:52Z

I apologize but I've given this a good deal of effort and I have not been successful in implementing this. Voices are complex and incorporating these requires much more than a minor change.

I have not found any simple conversion to allow these voices to be streamed in EDDI (it might be possible to generate the entire speech and then play it as a .wav file but this would be significantly slower than streaming the speech as it is generated). Further, these voices don't contain the same metadata (things like voice name, culture, etc.) and generally do not support SSML (which significantly limits our ability to influence / correct bad pronunciations).

Tkael · 2025-01-26T22:25:13Z

I do think that if we were to implement this we would need to base it on voice models using the .onnx file format. We would also need to know the source of the file so that we could configure the correct inputs to generate a wave form from that model and prepare a package of metadata (like a friendly human name, culture, etc. for each voice that can be generated by each voice model). We would then need to be able to take that wave form and stream it to EDDI (so that we can begin speaking as speech is still being rendered and so that we can apply our own audio modifications to the output, e.g. radio effects).

Tkael added the 9. enhancement The behaviour is as specified, but we would like to modify or extend the spec. label Jan 18, 2025

Tkael added the significant work Just sayin' label Jan 26, 2025

Tkael changed the title ~~Piper AI Voices~~ AI Voices Jan 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Voices #2691

AI Voices #2691

Tkael commented Jan 18, 2025

Tkael commented Jan 26, 2025

Tkael commented Jan 26, 2025

AI Voices #2691

AI Voices #2691

Comments

Tkael commented Jan 18, 2025

Discussed in #2688

Tkael commented Jan 26, 2025

Tkael commented Jan 26, 2025