About speechT5 is trainable？ #4

SuperiorDtj · 2024-06-12T07:14:53Z

I found that in I found that in the training code, speecht5 can be trained.
However, in the inference code, speecht5 is loaded with Microsoft's public weights.
Could you please clarify whether training speecht5 affects the results?

glory20h · 2024-06-12T08:10:11Z

Hi, in the inference code, speecht5 is loaded initially with public weights, but the parameters are overwritten again with state_dict from the VoiceLDM checkpoint.

SuperiorDtj · 2024-06-12T08:32:37Z

Thanks for your quick reply！

SuperiorDtj · 2024-06-12T09:35:31Z

Hi, in the inference code, speecht5 is loaded initially with public weights, but the parameters are overwritten again with state_dict from the VoiceLDM checkpoint.

I have another question, if you don't mind answering.
Can using a regular phoneme sequence embedding network instead of SpeechT5 achieve the same effect?
In other words, is SpeechT5 necessary for modeling duration information?
Or can a regular nn.embedder + Durator achieve similar results?

glory20h · 2024-06-12T10:38:38Z

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job.
Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.

SuperiorDtj · 2024-06-12T12:02:51Z

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.
Thanks for your advice! It's very helpful for my research!

SuperiorDtj · 2024-06-26T03:15:31Z

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.

Have you tried freezing the parameters of SpeechT5? Or, is it necessary to update the text encoder parameters in this TTS modeling approach?

glory20h · 2024-06-26T08:23:37Z

I have tried both, and found that updating the text encoder's parameters led to better performance.

SuperiorDtj · 2024-06-26T08:26:57Z

I have tried both, and found that updating the text encoder's parameters led to better performance.

Thanks for your reply! It's very helpful for my research!

SuperiorDtj closed this as completed Jun 12, 2024

SuperiorDtj reopened this Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About speechT5 is trainable？ #4

About speechT5 is trainable？ #4

SuperiorDtj commented Jun 12, 2024

glory20h commented Jun 12, 2024

SuperiorDtj commented Jun 12, 2024

SuperiorDtj commented Jun 12, 2024

glory20h commented Jun 12, 2024

SuperiorDtj commented Jun 12, 2024

SuperiorDtj commented Jun 26, 2024

glory20h commented Jun 26, 2024

SuperiorDtj commented Jun 26, 2024

About speechT5 is trainable？ #4

About speechT5 is trainable？ #4

Comments

SuperiorDtj commented Jun 12, 2024

glory20h commented Jun 12, 2024

SuperiorDtj commented Jun 12, 2024

SuperiorDtj commented Jun 12, 2024

glory20h commented Jun 12, 2024

SuperiorDtj commented Jun 12, 2024

SuperiorDtj commented Jun 26, 2024

glory20h commented Jun 26, 2024

SuperiorDtj commented Jun 26, 2024