Tips for training a speaker embeddings model for fast convergence? #10039

gabitza-tech · 2024-08-05T12:31:18Z

gabitza-tech
Aug 5, 2024

Hello everybody,

I would like to train an Ecapa-TDNN model on Voxceleb2 and Voxceleb2+Cnceleb2 for speaker verification/identification. I don't want it to be SOA (ex: complicated pretraining and then large margin finetuning), but I want it to be moderately performant as fast as possible, as the model performance is not the primary focus.

I can train on an A100 gpu. I was thinking of randomly cropping audios in train-set to 3s and then applying RIR+musan+(0.95-1.05) speed perturbation+ spec. augment. I would also use the AAM Softmax with s=30, m=0.2 loss. How many epochs would I need to train this model and what is the expected time for training such a model? (on Voxceleb2 for example) Is it possible to obtain good performances with such a simple configuration? (under 1% EER on VoxCeleb1-O for example)

I saw that some people apply offline augmentation to create more speakers, or do 2-stage training with pretraining and then fine-tuning, etc. But I would like to train both models in just a couple of days (as an A100 is a pretty big GPU), and I don't want to achieve peak performances, such as the more complicated pipelines, but decent performances.

Any other tips would be greatly appreciated! I would mostly like to know if I could achieve these performances in a reasonable time and any other tips for fast convergence in as few epochs as possible. :D

Thank you in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tips for training a speaker embeddings model for fast convergence? #10039

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Tips for training a speaker embeddings model for fast convergence? #10039

gabitza-tech Aug 5, 2024

Replies: 0 comments

gabitza-tech
Aug 5, 2024