Speaker diarization - Extracting average-clustered speaker embeddings? #6912
Replies: 2 comments
-
I found this method to supposedly calculate the speaker embedding feature vectors: get_cluster_avg_embs_model() for msdd models here: However you need 4 parameters (embs, clus_label_index, ms_seg_counts, scale_mapping) to calculate the embeddings and I do not know how to get them out of my model, since they are all torch tensors. The documentation explains them, but I was not able to finde any parameters by looking through the model with model.embs. Maybe someone can help :/ |
Beta Was this translation helpful? Give feedback.
-
Check out this response. #8171 |
Beta Was this translation helpful? Give feedback.
-
So for my use case I would like to perform speaker diarization on an audio file, but I would like to extract the speaker embeddings and compare them to a database, where i have saved speaker embeddings with known entities. What is the best way to do this? From my understanding it would not make sense to take the base scale embeddings from titanet, because they are a bad representation of the speaker.
I imagined a workflow where I read in the audio file, process it with complete speaker diarization (vad->titanet->msdd) and then access
the speaker embedding from all labeled speakers (for example speaker 0, speaker 1, speaker 2), So ideally I would have one speaker embedding for each speaker and compare it the embeddings in my database and change the label accordingly.
I just do not know how to get these embeddings, since the different scales are clustered by MSDD and combined into an average-clustered embedding, according to weights and more ... (also see: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_diarization/models.html)
I would appreciate any help from you guys :) Best regards Aaron
Beta Was this translation helpful? Give feedback.
All reactions