Tips for training a speaker embeddings model for fast convergence? #10039
Closed
gabitza-tech
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everybody,
I would like to train an Ecapa-TDNN model on Voxceleb2 and Voxceleb2+Cnceleb2 for speaker verification/identification. I don't want it to be SOA (ex: complicated pretraining and then large margin finetuning), but I want it to be moderately performant as fast as possible, as the model performance is not the primary focus.
I can train on an A100 gpu. I was thinking of randomly cropping audios in train-set to 3s and then applying RIR+musan+(0.95-1.05) speed perturbation+ spec. augment. I would also use the AAM Softmax with s=30, m=0.2 loss. How many epochs would I need to train this model and what is the expected time for training such a model? (on Voxceleb2 for example) Is it possible to obtain good performances with such a simple configuration? (under 1% EER on VoxCeleb1-O for example)
I saw that some people apply offline augmentation to create more speakers, or do 2-stage training with pretraining and then fine-tuning, etc. But I would like to train both models in just a couple of days (as an A100 is a pretty big GPU), and I don't want to achieve peak performances, such as the more complicated pipelines, but decent performances.
Any other tips would be greatly appreciated! I would mostly like to know if I could achieve these performances in a reasonable time and any other tips for fast convergence in as few epochs as possible. :D
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions