Speaker diarization - Extracting average-clustered speaker embeddings? #6912

aaronsmith1x · 2023-06-23T09:32:33Z

aaronsmith1x
Jun 23, 2023

So for my use case I would like to perform speaker diarization on an audio file, but I would like to extract the speaker embeddings and compare them to a database, where i have saved speaker embeddings with known entities. What is the best way to do this? From my understanding it would not make sense to take the base scale embeddings from titanet, because they are a bad representation of the speaker.

I imagined a workflow where I read in the audio file, process it with complete speaker diarization (vad->titanet->msdd) and then access
the speaker embedding from all labeled speakers (for example speaker 0, speaker 1, speaker 2), So ideally I would have one speaker embedding for each speaker and compare it the embeddings in my database and change the label accordingly.

I just do not know how to get these embeddings, since the different scales are clustered by MSDD and combined into an average-clustered embedding, according to weights and more ... (also see: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_diarization/models.html)

I would appreciate any help from you guys :) Best regards Aaron

aaronsmith1x · 2023-06-29T15:51:55Z

aaronsmith1x
Jun 29, 2023
Author

I found this method to supposedly calculate the speaker embedding feature vectors:

get_cluster_avg_embs_model() for msdd models here:
https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_diarization/api.html

However you need 4 parameters (embs, clus_label_index, ms_seg_counts, scale_mapping) to calculate the embeddings and I do not know how to get them out of my model, since they are all torch tensors. The documentation explains them, but I was not able to finde any parameters by looking through the model with model.embs.

Maybe someone can help :/

0 replies

simondpalmer · 2024-12-18T18:32:55Z

simondpalmer
Dec 18, 2024

Check out this response. #8171

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker diarization - Extracting average-clustered speaker embeddings? #6912

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Speaker diarization - Extracting average-clustered speaker embeddings? #6912

aaronsmith1x Jun 23, 2023

Replies: 2 comments

aaronsmith1x Jun 29, 2023 Author

simondpalmer Dec 18, 2024

aaronsmith1x
Jun 23, 2023

aaronsmith1x
Jun 29, 2023
Author

simondpalmer
Dec 18, 2024