Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI Voices #2691

Open
Tkael opened this issue Jan 18, 2025 · 2 comments
Open

AI Voices #2691

Tkael opened this issue Jan 18, 2025 · 2 comments
Labels
9. enhancement The behaviour is as specified, but we would like to modify or extend the spec. significant work Just sayin'

Comments

@Tkael
Copy link
Member

Tkael commented Jan 18, 2025

Discussed in #2688

Originally posted by Transcan January 18, 2025
Hello,

I'm currently working in a script that reads the messages NPCs send you each of them with a different voice.
I use Spanish and there isn't a lot of voices to choose from, so the messages repeats the same voices too often.

I came across this Piper project. It a project to use locally AI generated voices for TTS.
These new AI voices open up a range of possibilities to choose from.
It looks promising and I wonder if it can be used with EDDI.

Also, the quality of this voices are greater than most of the Windows native voices (The only one that is decent enough, and the one that I use for my personality, is Cortana's voice).

At the moment it doesn't create a system voice (SAPI) that can be used directly with EDDI, and I don't know if that feature will ever exists, but right now I think it can be used with EDDI with minor changes (just my guess, I'm not a professional programmer).

I leave the link to the project here for the masters take a look:
https://github.com/rhasspy/piper

Have a nice day. o7

@Tkael Tkael added the 9. enhancement The behaviour is as specified, but we would like to modify or extend the spec. label Jan 18, 2025
@Tkael Tkael added the significant work Just sayin' label Jan 26, 2025
@Tkael
Copy link
Member Author

Tkael commented Jan 26, 2025

I apologize but I've given this a good deal of effort and I have not been successful in implementing this. Voices are complex and incorporating these requires much more than a minor change.

I have not found any simple conversion to allow these voices to be streamed in EDDI (it might be possible to generate the entire speech and then play it as a .wav file but this would be significantly slower than streaming the speech as it is generated). Further, these voices don't contain the same metadata (things like voice name, culture, etc.) and generally do not support SSML (which significantly limits our ability to influence / correct bad pronunciations).

@Tkael Tkael changed the title Piper AI Voices AI Voices Jan 26, 2025
@Tkael
Copy link
Member Author

Tkael commented Jan 26, 2025

I do think that if we were to implement this we would need to base it on voice models using the .onnx file format. We would also need to know the source of the file so that we could configure the correct inputs to generate a wave form from that model and prepare a package of metadata (like a friendly human name, culture, etc. for each voice that can be generated by each voice model). We would then need to be able to take that wave form and stream it to EDDI (so that we can begin speaking as speech is still being rendered and so that we can apply our own audio modifications to the output, e.g. radio effects).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
9. enhancement The behaviour is as specified, but we would like to modify or extend the spec. significant work Just sayin'
Projects
None yet
Development

No branches or pull requests

1 participant