-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NaturalSpeech2 #35
Conversation
Add NaturalSpeech2 models, train, and inference. NS2 predicts latent of encodec, and use decoder of encodec to generate wavform. We also offer a pretrained checkpoint (trained on LibriTTS) for user to inference. |
Where is the pretrained checkpoint? And are there any generated samples? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge run_inference.sh and run_trian.sh into a single file, i.e., run.sh, and provide a recipe for NaturalSpeech2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge run_inference.sh and run_trian.sh into a single file, i.e., run.sh, and provide a recipe for NaturalSpeech2.
Same suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this trainer and TTS trainer in models/tts/base/tts_trainer.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because some initialization is useless for ns2. I don't want to inherit TTS trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't the NS2 trainer directly inherit TTS trainer (defined in models/tts/base/tts_trainer.py) but instead inherit a newly defined trainer that is similar to the TTS trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't the NS2 trainer directly inherit TTS trainer (defined in models/tts/base/tts_trainer.py) but instead inherit a newly defined trainer that is similar to the TTS trainer?
Same question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to move this module to the directory of modules/encoder or modules/naturalspech2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? Prior encoder and diffusion are also parts of NS2 models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about your discussions. A general advice: if prior_encoder.py
can be used commonly for other model except for ns2, then you need to move it into modules
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think prior_encoder is designed especially for NS2 (until now). So I will put it under models/tts/ns2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the models
folder should only contain the model (e.g., fastspeech2, vits, valle) only, the related module should be placed in the modules
folder, especially since you have created the folder modules/naturalspeech2
.
models/tts/naturalspeech2/wavenet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future improvement: merge this wavenet with Amphion wavenet vocoder (https://github.com/open-mmlab/Amphion/blob/main/models/vocoders/autoregressive/wavenet/wavenet.py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future improvement: merge this wavenet with Amphion wavenet vocoder (https://github.com/open-mmlab/Amphion/blob/main/models/vocoders/autoregressive/wavenet/wavenet.py)
Approve for the "merge" idea. This name wavenet.py
is confusing to some extent. It is not a vocoder. I think it is more like diffwavenet
, which has existed in Amphion:
class BiDilConv(nn.Module): |
@HeCheng0625 You can merge this wavenet.py
with the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do it future. Now the wavenet is designed only for NS2. And it has a lot of different input compared to BiDilConv.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the copyright information for all the newly added files, including .py
and .sh
files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge run_inference.sh and run_trian.sh into a single file, i.e., run.sh, and provide a recipe for NaturalSpeech2.
Same suggestions.
from models.base.base_sampler import build_samplers | ||
|
||
|
||
class TTSTrainer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no inheritance? Although Line27 says "it inherits..."
models/tts/naturalspeech2/ns2.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add copyright information for all the newly added files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't the NS2 trainer directly inherit TTS trainer (defined in models/tts/base/tts_trainer.py) but instead inherit a newly defined trainer that is similar to the TTS trainer?
Same question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about your discussions. A general advice: if prior_encoder.py
can be used commonly for other model except for ns2, then you need to move it into modules
.
models/tts/naturalspeech2/wavenet.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future improvement: merge this wavenet with Amphion wavenet vocoder (https://github.com/open-mmlab/Amphion/blob/main/models/vocoders/autoregressive/wavenet/wavenet.py)
Approve for the "merge" idea. This name wavenet.py
is confusing to some extent. It is not a vocoder. I think it is more like diffwavenet
, which has existed in Amphion:
class BiDilConv(nn.Module): |
@HeCheng0625 You can merge this wavenet.py
with the existing one.
Paper: https://arxiv.org/abs/2304.09116 |
Add NaturalSpeech2 models, train, and inference. NS2 predicts latent of encodec, and use decoder of encodec to generate wavform. We also offer a pretrained checkpoint (trained on LibriTTS) for user to inference.
Paper: https://arxiv.org/abs/2304.09116
CKPT: https://huggingface.co/amphion/naturalspeech2_libritts
Demo: https://huggingface.co/spaces/amphion/NaturalSpeech2