mandrain results sharing and training support #139

liuhuang31 · 2023-12-08T07:43:15Z

Share your Chinese synthesis results or mandrain model training questions.

liuhuang31 · 2023-12-08T07:57:06Z

In styletts1:
ref audios as belows: ref.zip
@GuangChen2016 hi brother, i use your provided ref_audio to generate styletts results. If it makes you uncomfortable, I will take it down immediately.
ref text is: 杭州亚运会即将在9月开幕，这是继北京冬奥会之后，我国再次承办的一项国际大型体育赛事。然而，在这场盛会上，我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外，无缘参加杭州亚运会。这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会，而不是对他们进行歧视和限制。我国认为，运动员是否参赛应该由他们自己的体育表现决定，而不是其他因素，包括战争等。我国还表示，愿意为他们搭建一个良好的参赛平台，让他们以中立身份参赛，并且不会影响奖牌的分配。
ref generate results is: ref_gen.zip

In styletts2:
ref audios as styletts1.
ref text is:
00000001 Stay cool he added with a smile.
00000002 你从楼梯跑上了二楼的走廊，深邃，阴暗，空气中散发着沉闷的味道。你拄着膝盖大口的喘着粗气。
00000003 一股深入毛孔的恐惧感，围绕着你，似乎有什么可怕的东西，正在从楼梯向上爬。幽暗的走廊里，唯一的光源，是一个昏暗的灯泡，就在你前方的不远处。
ref generate results is: ref_styletts2_gen.zip

zhouyong64 · 2023-12-09T13:26:38Z

In styletts1: ref audios as belows: ref.zip @GuangChen2016 hi brother, i use your provided ref_audio to generate styletts results. If it makes you uncomfortable, I will take it down immediately. ref text is: 杭州亚运会即将在9月开幕，这是继北京冬奥会之后，我国再次承办的一项国际大型体育赛事。然而，在这场盛会上，我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外，无缘参加杭州亚运会。这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会，而不是对他们进行歧视和限制。我国认为，运动员是否参赛应该由他们自己的体育表现决定，而不是其他因素，包括战争等。我国还表示，愿意为他们搭建一个良好的参赛平台，让他们以中立身份参赛，并且不会影响奖牌的分配。 ref generate results is: ref_gen.zip

In styletts2: ref audios as styletts1. ref text is: 00000001 Stay cool he added with a smile. 00000002 你从楼梯跑上了二楼的走廊，深邃，阴暗，空气中散发着沉闷的味道。你拄着膝盖大口的喘着粗气。 00000003 一股深入毛孔的恐惧感，围绕着你，似乎有什么可怕的东西，正在从楼梯向上爬。幽暗的走廊里，唯一的光源，是一个昏暗的灯泡，就在你前方的不远处。 ref generate results is: ref_styletts2_gen.zip

这是用aishell3训练的吗？中文的合成韵律感觉很差啊，基本没停顿。

yl4579 · 2023-12-09T16:44:30Z

@zhouyong64 aishell is bad in general, it is just like VCTK, no emotions and flat prosodies.

blldd · 2023-12-10T09:32:07Z

In styletts1: ref audios as belows: ref.zip @GuangChen2016 hi brother, i use your provided ref_audio to generate styletts results. If it makes you uncomfortable, I will take it down immediately. ref text is: 杭州亚运会即将在9月开幕，这是继北京冬奥会之后，我国再次承办的一项国际大型体育赛事。然而，在这场盛会上，我们将看不到来自俄罗斯和白俄罗斯的运动员的身影。他们被国际奥委会以“技术原因”为由拒之门外，无缘参加杭州亚运会。这一决定引起了我国的不满和反对。我国一直主张欢迎符合条件的俄罗斯和白俄罗斯运动员参加杭州亚运会，而不是对他们进行歧视和限制。我国认为，运动员是否参赛应该由他们自己的体育表现决定，而不是其他因素，包括战争等。我国还表示，愿意为他们搭建一个良好的参赛平台，让他们以中立身份参赛，并且不会影响奖牌的分配。 ref generate results is: ref_gen.zip

In styletts2: ref audios as styletts1. ref text is: 00000001 Stay cool he added with a smile. 00000002 你从楼梯跑上了二楼的走廊，深邃，阴暗，空气中散发着沉闷的味道。你拄着膝盖大口的喘着粗气。 00000003 一股深入毛孔的恐惧感，围绕着你，似乎有什么可怕的东西，正在从楼梯向上爬。幽暗的走廊里，唯一的光源，是一个昏暗的灯泡，就在你前方的不远处。 ref generate results is: ref_styletts2_gen.zip

Hi liuhuang, thanks for your great sharing, and I have a question about how did you generate 48kHz audio in ref_gen.zip, because I find when I generate 48kHz audio, the audio sounds very high-pitched, but in your generation, it looks great in 48kHz, thanks for your help in advance :P

liuhuang31 · 2023-12-11T02:50:46Z

@blldd Hi, ref_gen.zip file is generated by styletts1 model, which is a acoustic model to generate 24k mel. And then use a super-resolution hifigan vocoder convert 24k_mel to 48k wav. As styletts1 and vocoder, their mel extract params is same.

blldd · 2023-12-11T03:22:03Z

@blldd Hi, ref_gen.zip file is generated by styletts1 model, which is a acoustic model to generate 24k mel. And then use a super-resolution hifigan vocoder convert 24k_mel to 48k wav. As styletts1 and vocoder, their mel extract params is same.

Great! Thanks for your help! I am also curious about the multi-language capability, cause I tried the StyleTTS2 trained on LbriTTS, and I find the model cannot apply to French text, cause the generated audio is spoken in English pronunciation.
So how do you get the model to speak Chinese well？

liuhuang31 · 2023-12-11T03:49:51Z

@blldd Hi, blldd. First i retrain the asr model use Chinese phoneme. Second for no chinese pl-bert exists, i remove the pl-bert module. And then use chinese data to train styletts2_removed_pl-bert_retrain_ASR model.

zhouyong64 · 2023-12-14T10:46:51Z

@blldd Hi, blldd. First i retrain the asr model use Chinese phoneme. Second for no chinese pl-bert exists, i remove the pl-bert module. And then use chinese data to train styletts2_removed_pl-bert_retrain_ASR model.

Which SLM model did you use for Chinese? I guess it's not microsoft/wavlm-base-plus.

liuhuang31 · 2023-12-14T10:53:34Z

@zhouyong64 Hi, for now, I am still using pure English microsoft/wavlm-base-plus. Changing to another one may require some changes to the model structure, so it remains unchanged.

mayfool · 2023-12-16T07:26:27Z

hi, if you remove the pl_module, did you replace it with the text encoder on the second training stage?

liuhuang31 · 2023-12-16T07:32:23Z

@mayfool hi, yes, i simply replace it with the text_encoder.

mayfool · 2023-12-16T08:04:36Z

@mayfool hi, yes, i simply replace it with the text_encoder.

Thanks for reply. Here're a few questions: 1. Did you use the text encoder pretrained from the 1st stage, or just the new text encoder without pretrain？ 2. Will such modification affect the zero-shot ability?

liuhuang31 · 2023-12-16T08:28:44Z

@mayfool hi,

use the text encoder pretrained from the 1st stage. In origin, pl_bert_output -> diffusion, for remove pl_bert, we changed to text_encoder_output -> diffusion.
In my view, pl_bert relates to text, pl_bert may help with prosody or naturalness. For zero-shot ability, remve it won't have much impact.

mayfool · 2023-12-16T08:33:50Z

@liuhuang31 Thanks a lot！

Moonmore · 2023-12-18T12:41:33Z

@blldd Hi, blldd. First i retrain the asr model use Chinese phoneme. Second for no chinese pl-bert exists, i remove the pl-bert module. And then use chinese data to train styletts2_removed_pl-bert_retrain_ASR model.

Which SLM model did you use for Chinese? I guess it's not microsoft/wavlm-base-plus.

@mayfool I use chinese_hubert_large model.

yl4579 mentioned this issue Dec 12, 2023

Use XPhoneBERT instead of the provided PL-BERT checkpoints. #140

Closed

yl4579 mentioned this issue Jan 8, 2024

Train a zero-shot voice adaptation model for a different accent/language #179

Closed

jarred1989 mentioned this issue Jan 24, 2024

slmadv using differentiable duration modeling may not be helpful and even bad #146

Closed

Repository owner locked and limited conversation to collaborators Mar 7, 2024

yl4579 converted this issue into discussion #210 Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

mandrain results sharing and training support #139

mandrain results sharing and training support #139

liuhuang31 commented Dec 8, 2023

liuhuang31 commented Dec 8, 2023 •

edited

Loading

zhouyong64 commented Dec 9, 2023

yl4579 commented Dec 9, 2023

blldd commented Dec 10, 2023

liuhuang31 commented Dec 11, 2023

blldd commented Dec 11, 2023

liuhuang31 commented Dec 11, 2023

zhouyong64 commented Dec 14, 2023

liuhuang31 commented Dec 14, 2023

mayfool commented Dec 16, 2023

liuhuang31 commented Dec 16, 2023

mayfool commented Dec 16, 2023

liuhuang31 commented Dec 16, 2023

mayfool commented Dec 16, 2023

Moonmore commented Dec 18, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

mandrain results sharing and training support #139

mandrain results sharing and training support #139

Comments

liuhuang31 commented Dec 8, 2023

liuhuang31 commented Dec 8, 2023 • edited Loading

zhouyong64 commented Dec 9, 2023

yl4579 commented Dec 9, 2023

blldd commented Dec 10, 2023

liuhuang31 commented Dec 11, 2023

blldd commented Dec 11, 2023

liuhuang31 commented Dec 11, 2023

zhouyong64 commented Dec 14, 2023

liuhuang31 commented Dec 14, 2023

mayfool commented Dec 16, 2023

liuhuang31 commented Dec 16, 2023

mayfool commented Dec 16, 2023

liuhuang31 commented Dec 16, 2023

mayfool commented Dec 16, 2023

Moonmore commented Dec 18, 2023

This issue was moved to a discussion.

liuhuang31 commented Dec 8, 2023 •

edited

Loading