Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mandrain results sharing and training support #139

Closed
liuhuang31 opened this issue Dec 8, 2023 · 15 comments
Closed

mandrain results sharing and training support #139

liuhuang31 opened this issue Dec 8, 2023 · 15 comments

Comments

@liuhuang31
Copy link

Share your Chinese synthesis results or mandrain model training questions.

@liuhuang31
Copy link
Author

liuhuang31 commented Dec 8, 2023

In styletts1:
ref audios as belows: ref.zip
@GuangChen2016 hi brother, i use your provided ref_audio to generate styletts results. If it makes you uncomfortable, I will take it down immediately.
ref text is: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。 这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。
ref generate results is: ref_gen.zip

In styletts2:
ref audios as styletts1.
ref text is:
00000001 Stay cool he added with a smile.
00000002 你从楼梯跑上了二楼的走廊,深邃,阴暗,空气中散发着沉闷的味道。你拄着膝盖大口的喘着粗气。
00000003 一股深入毛孔的恐惧感,围绕着你,似乎有什么可怕的东西,正在从楼梯向上爬。幽暗的走廊里,唯一的光源,是一个昏暗的灯泡,就在你前方的不远处。
ref generate results is: ref_styletts2_gen.zip

@zhouyong64
Copy link

In styletts1: ref audios as belows: ref.zip @GuangChen2016 hi brother, i use your provided ref_audio to generate styletts results. If it makes you uncomfortable, I will take it down immediately. ref text is: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。 这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。 ref generate results is: ref_gen.zip

In styletts2: ref audios as styletts1. ref text is: 00000001 Stay cool he added with a smile. 00000002 你从楼梯跑上了二楼的走廊,深邃,阴暗,空气中散发着沉闷的味道。你拄着膝盖大口的喘着粗气。 00000003 一股深入毛孔的恐惧感,围绕着你,似乎有什么可怕的东西,正在从楼梯向上爬。幽暗的走廊里,唯一的光源,是一个昏暗的灯泡,就在你前方的不远处。 ref generate results is: ref_styletts2_gen.zip

这是用aishell3训练的吗?中文的合成韵律感觉很差啊,基本没停顿。

@yl4579
Copy link
Owner

yl4579 commented Dec 9, 2023

@zhouyong64 aishell is bad in general, it is just like VCTK, no emotions and flat prosodies.

@blldd
Copy link

blldd commented Dec 10, 2023

In styletts1: ref audios as belows: ref.zip @GuangChen2016 hi brother, i use your provided ref_audio to generate styletts results. If it makes you uncomfortable, I will take it down immediately. ref text is: 杭州亚运会即将在9月开幕,这是继北京冬奥会之后,我国再次承办的一项国际大型体育赛事。然而,在这场盛会上,我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外,无缘参加杭州亚运会。 这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会,而不是对他们进行歧视和限制。我国认为,运动员是否参赛应该由他们自己的体育表现决定,而不是其他因素,包括战争等。我国还表示,愿意为他们搭建一个良好的参赛平台,让他们以中立身份参赛,并且不会影响奖牌的分配。 ref generate results is: ref_gen.zip

In styletts2: ref audios as styletts1. ref text is: 00000001 Stay cool he added with a smile. 00000002 你从楼梯跑上了二楼的走廊,深邃,阴暗,空气中散发着沉闷的味道。你拄着膝盖大口的喘着粗气。 00000003 一股深入毛孔的恐惧感,围绕着你,似乎有什么可怕的东西,正在从楼梯向上爬。幽暗的走廊里,唯一的光源,是一个昏暗的灯泡,就在你前方的不远处。 ref generate results is: ref_styletts2_gen.zip

Hi liuhuang, thanks for your great sharing, and I have a question about how did you generate 48kHz audio in ref_gen.zip, because I find when I generate 48kHz audio, the audio sounds very high-pitched, but in your generation, it looks great in 48kHz, thanks for your help in advance :P

@liuhuang31
Copy link
Author

@blldd Hi, ref_gen.zip file is generated by styletts1 model, which is a acoustic model to generate 24k mel. And then use a super-resolution hifigan vocoder convert 24k_mel to 48k wav. As styletts1 and vocoder, their mel extract params is same.

@blldd
Copy link

blldd commented Dec 11, 2023

@blldd Hi, ref_gen.zip file is generated by styletts1 model, which is a acoustic model to generate 24k mel. And then use a super-resolution hifigan vocoder convert 24k_mel to 48k wav. As styletts1 and vocoder, their mel extract params is same.

Great! Thanks for your help! I am also curious about the multi-language capability, cause I tried the StyleTTS2 trained on LbriTTS, and I find the model cannot apply to French text, cause the generated audio is spoken in English pronunciation.
So how do you get the model to speak Chinese well?

@liuhuang31
Copy link
Author

@blldd Hi, blldd. First i retrain the asr model use Chinese phoneme. Second for no chinese pl-bert exists, i remove the pl-bert module. And then use chinese data to train styletts2_removed_pl-bert_retrain_ASR model.

@zhouyong64
Copy link

@blldd Hi, blldd. First i retrain the asr model use Chinese phoneme. Second for no chinese pl-bert exists, i remove the pl-bert module. And then use chinese data to train styletts2_removed_pl-bert_retrain_ASR model.

Which SLM model did you use for Chinese? I guess it's not microsoft/wavlm-base-plus.

@liuhuang31
Copy link
Author

@zhouyong64 Hi, for now, I am still using pure English microsoft/wavlm-base-plus. Changing to another one may require some changes to the model structure, so it remains unchanged.

@mayfool
Copy link

mayfool commented Dec 16, 2023

hi, if you remove the pl_module, did you replace it with the text encoder on the second training stage?

@liuhuang31
Copy link
Author

@mayfool hi, yes, i simply replace it with the text_encoder.

@mayfool
Copy link

mayfool commented Dec 16, 2023

@mayfool hi, yes, i simply replace it with the text_encoder.

Thanks for reply. Here're a few questions: 1. Did you use the text encoder pretrained from the 1st stage, or just the new text encoder without pretrain? 2. Will such modification affect the zero-shot ability?

@liuhuang31
Copy link
Author

@mayfool hi,

  1. use the text encoder pretrained from the 1st stage. In origin, pl_bert_output -> diffusion, for remove pl_bert, we changed to text_encoder_output -> diffusion.
  2. In my view, pl_bert relates to text, pl_bert may help with prosody or naturalness. For zero-shot ability, remve it won't have much impact.

@mayfool
Copy link

mayfool commented Dec 16, 2023

@liuhuang31 Thanks a lot!

@Moonmore
Copy link

@blldd Hi, blldd. First i retrain the asr model use Chinese phoneme. Second for no chinese pl-bert exists, i remove the pl-bert module. And then use chinese data to train styletts2_removed_pl-bert_retrain_ASR model.

Which SLM model did you use for Chinese? I guess it's not microsoft/wavlm-base-plus.

@mayfool I use chinese_hubert_large model.

Repository owner locked and limited conversation to collaborators Mar 7, 2024
@yl4579 yl4579 converted this issue into discussion #210 Mar 7, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants