In this paper, we use the modular architecture on raw waveform speaker embedding, to be specific: a waveform encoder and deep embedding backbone.
Official inference code for Y-vector ( and unofficial code for wav2spk (
In our experiment, we train on VoxCeleb2 Dev dataset, and test on VoxCeleb1 dataset.
Provided pretrained model (link) results (EER, minDCF(0.01)):
Metric | VoxCeleb1-O | VoxCeleb1-E | VoxCeleb1-H |
EER | 2.35 | 2.32 | 3.89 |
minDCF(0.01) | 0.242 | 0.235 | 0.349 |
numba==0.48 # install before librosa
pandas is for reproducing results on the table, speaker embedding extraction is simply load wav and run forward model with pretrained models.
Guidance on reproducing the results: after installing required packages, download VoxCeleb1 data first then use to save input feature into pickle files. Saved pkl files format: spkid-recid-fileid.pkl
Then run to compute embeddings and test EER and minDCF.
To inference embeddings for other datasets, the minimum length of input utterance should be longer than 4 seconds.
Cosine similarity score results on VoxCeleb-1 Test dataset (EER):
System | VoxCeleb1-O | VoxCeleb1-E | VoxCeleb1-H |
wav2spk | 3.00 | 2.78 | 4.56 |
Y-vector. | 2.72 | 2.38 | 3.87 |
(Notice that VoxCeleb1-O can fluctuate a lot in our experimental setting)
It's possible to boost the performance by replacing each part with stronger networks. For example, replace backbone with F-TDNN, E-TDNN or ECAPA-TDNN.