Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About speechT5 is trainable? #4

Open
SuperiorDtj opened this issue Jun 12, 2024 · 8 comments
Open

About speechT5 is trainable? #4

SuperiorDtj opened this issue Jun 12, 2024 · 8 comments

Comments

@SuperiorDtj
Copy link

I found that in I found that in the training code, speecht5 can be trained.
However, in the inference code, speecht5 is loaded with Microsoft's public weights.
Could you please clarify whether training speecht5 affects the results?

@glory20h
Copy link
Owner

Hi, in the inference code, speecht5 is loaded initially with public weights, but the parameters are overwritten again with state_dict from the VoiceLDM checkpoint.

@SuperiorDtj
Copy link
Author

Thanks for your quick reply!

@SuperiorDtj
Copy link
Author

Hi, in the inference code, speecht5 is loaded initially with public weights, but the parameters are overwritten again with state_dict from the VoiceLDM checkpoint.

I have another question, if you don't mind answering.
Can using a regular phoneme sequence embedding network instead of SpeechT5 achieve the same effect?
In other words, is SpeechT5 necessary for modeling duration information?
Or can a regular nn.embedder + Durator achieve similar results?

@glory20h
Copy link
Owner

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job.
Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.

@SuperiorDtj
Copy link
Author

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.
Thanks for your advice! It's very helpful for my research!

@SuperiorDtj
Copy link
Author

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.

Have you tried freezing the parameters of SpeechT5? Or, is it necessary to update the text encoder parameters in this TTS modeling approach?

@glory20h
Copy link
Owner

I have tried both, and found that updating the text encoder's parameters led to better performance.

@SuperiorDtj
Copy link
Author

I have tried both, and found that updating the text encoder's parameters led to better performance.

Thanks for your reply! It's very helpful for my research!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants