Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27

AK391 · 2021-10-26T03:00:27Z

just saw this paper https://arxiv.org/abs/2110.12676, when will the repo be updated for this thanks

980202006 · 2021-10-26T03:43:28Z

Is there any detail about the speaker embedding? Such as What model is used to generate it, whether it is pre-trained, and what data set is used

wookladin · 2021-10-26T07:02:35Z

@AK391
Thanks for your interest!
Currently, we don't have a specific plan to release the code of that paper.
We will add the link to the paper and demo page at README soon.

@980202006
We just used nn.Embedding without pre-training. Thanks!

980202006 · 2021-10-26T07:03:22Z

Thank you!

iehppp2010 · 2021-12-28T09:42:05Z

@wookladin

I have tried to reproduce this paper.
My train mse mel loss reaches 0.18, while dev mel loss stops at 0.75.

After train the 'Decoder' model, I use this model to do GTA fine finetuning on the HiFi-GAN model you provide.
Below is the HiFi-GAN model fine tuning loss

After that, I try to control speaker Identity by just switching the speaker embedding to target speaker, which is the the
way you said in the paper.
I use a trained audio of CSD female speaker as the reference audio(the link below).
https://drive.google.com/file/d/1QCGlfREai1AgkKnrLhdvZm-jt_k50R79/view?usp=sharing
I use the speaker PAMR in NUS-48E dataset as target speaker.
https://drive.google.com/file/d/19eL1XgAjR4eWTFv7M5jaMJMCWIC17m36/view?usp=sharing
The result audio is:
https://drive.google.com/file/d/1XsaWrSQ2xtiohbjpm6fFU-V28o4pp2wM/view?usp=sharing

I found that lyrics are hard to hear clearly.

My dataset config:
devset:
CSD speaker, these three audio en48/en49/en50 were chosed;
NUS-48E speakers, ADIZ's 13 and JLEE 05 were chosed.
trainset: the other songs in CSD and NUS-48E.

My speaker embedding dimension is 256.( It seems 256 is too large?)

I want to know what could be the problem with my model?
And can you share you Decoder model train/dev loss?
My Decoder model got a relative larger mel MSE loss on devset than trainset.

wookladin · 2021-12-29T02:46:52Z

@iehppp2010
Hi. I think your alignment encoder, 'Cotatron' doesn't seem to be working properly.
As explained in the paper, we transferred Cotatron from pre-trained weights, which are trained with LibriTTS and VCTK.
Did you transfer from those weights?
You can find pre-trained weights in this Google Drive link.

iehppp2010 · 2021-12-29T06:10:09Z

@wookladin
Thanks for your quick reply.
I do used the pre-trained weights.
When I train the 'Decoder' model, the 'Cotatron' aligner model is freezed.
I found the plotted alignment is not as good as other TTS model，e.g. Tacotron2.

I want to know if I need to do fine-tune the 'Cotatron' model on singing dataset to get better alignment result?
Wish your reply.

wookladin · 2021-12-29T06:22:41Z

@iehppp2010
Yes.
You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset.
It would generate better alignment and sample quality

iehppp2010 · 2022-01-06T07:55:32Z

@iehppp2010 Yes. You first have to fine-tune Cotatron model on the singing dataset, because the average duration of each phoneme is much longer in the singing dataset. It would generate better alignment and sample quality

@wookladin
Thanks for you quickly reply.
After I fine-tune the Cotatron model, the train.loss_reconstruction converges about 0.2, while val.loss_reconstruction
got mininum value about 0.5 at step 3893.

I use that checkpiont to train the Decoder model and fine tune HIFI-GAN vocoder.

I found that when test with an audio if the fine-tuned Cotatron model never seen it, I can't get good sample quality.
I guess it's the reason that Cotatron model gives not good alignment...

So, I want to know how to let the Cotatron model get better alignment on unseen sing audio?
Besides, can you provide more training details?

betty97 · 2022-05-11T12:25:21Z

@iehppp2010, I am also trying to reproduce the results of this paper. I have one doubt regarding the dataset preparation: how did you split the files? In the paper it is said that "all singing voices are split between 1-12 seconds", did you do it manually for both CSD and NUS-48E, or how? Thanks!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27

Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27

AK391 commented Oct 26, 2021

980202006 commented Oct 26, 2021

wookladin commented Oct 26, 2021

980202006 commented Oct 26, 2021

iehppp2010 commented Dec 28, 2021 •

edited

Loading

wookladin commented Dec 29, 2021

iehppp2010 commented Dec 29, 2021 •

edited

Loading

wookladin commented Dec 29, 2021

iehppp2010 commented Jan 6, 2022

betty97 commented May 11, 2022

Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27

Controllable and Interpretable Singing Voice Decomposition via Assem-VC #27

Comments

AK391 commented Oct 26, 2021

980202006 commented Oct 26, 2021

wookladin commented Oct 26, 2021

980202006 commented Oct 26, 2021

iehppp2010 commented Dec 28, 2021 • edited Loading

wookladin commented Dec 29, 2021

iehppp2010 commented Dec 29, 2021 • edited Loading

wookladin commented Dec 29, 2021

iehppp2010 commented Jan 6, 2022

betty97 commented May 11, 2022

iehppp2010 commented Dec 28, 2021 •

edited

Loading

iehppp2010 commented Dec 29, 2021 •

edited

Loading