-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68
Comments
This sounds like an interesting idea to investigate. Do you have experiments backing the claim that this new metric is a good indicator of the actual performance of a diarization system? I would love to know more... |
I've expanded my testing to include VoxConverse_test (0.3) and three internal datasets, applying the "pyannote/speaker-diarization-3.1" model with and without the Regarding the metric itself, I initially adapted a form of Mean Squared Error (MSE) for simplicity. The formulas are as follows:
For Speaker Number Difference, we have two versions:
I will bring measured DER, JER, and SND soon. |
thanks to @upskyy (my co-worker), I've added column Upon analyzing the metrics, we noticed that while the DER for AVA-AVD was unexpectedly high, indicating a need for further examination, other datasets like AISHELL-4, AliMeeting, DIHARD 3, Earnings21, and MSDWild showed promising results. When DER/JER is high, SND is high. |
Sorry for the delay in coming back to this. Can you please explain the difference between v3.1 and v3.1 (ours)? I thought the point was to show that the new proposed metric correlated with the legacy DER metric. But those numbers do not really show that. We would need to add SND for v3.1 (original) as well... Or did I miss something? |
Sorry for any confusion in my previous messages. To clarify, my intention was to demonstrate that the DER metric is correlated with the newly proposed SND metric when using the original v3.1 version of pyannote. The numbers listed under "pyannote v3.1 (by Ours)" were intended to show the evaluation results of both DER and SND across all datasets, to ensure reproducibility and comprehensive understanding. For clarify, I modify from |
I am sorry but there is still something that I don’t understand. Why don’t 3.1 and 3.1 (yours) reach the same DER values? |
I'm not entirely sure why there is a discrepancy between the DER values of 3.1 and 3.1 (by me). However, I used the latest pyannote library and recently organized datasets such as AISHELL-4 and AVA-AVD, which I downloaded from the official repository. While the DER trends between 3.1 and 3.1 (by me) seem generally similar, I too am curious about the reasons for these differences. It appears there might be subtle variations between the two, but since the overall trends in DER are similar, it suggested that measuring SND and DER together would facilitate clearer communication, hence my decision to share all metrics. Here are the versions of the libraries in my environment:
|
The output I computed are available here: https://huggingface.co/pyannote/speaker-diarization-3.1/tree/main/reproducible_research Can you please compare with yours? |
With reproducible_research, I found that I found some errors in above table. Above table was evaluated with "skip_overlap=True". I am sorry for confusing you. Below table is reproduced again.
By the way, AVA-AVD's rttm has different number of segment. I have downloaded AVA-AVD recently from https://github.com/zcxu-eric/AVA-AVD/tree/main/dataset using script. For example, record id So, except AVA-AVD, finally I think I'm on the same page with you. |
That's reassuring, thanks :) I have created scatter plots with your numbers to get a better (visual) idea. That being said, my understanding (tell me if I am wrong) is that you would like to use So, instead of scatter plots accross datasets, one should rather do those plots accross systems (e.g. pyannote 2.1 vs. pyannote 3.1 vs. speechbrain's vs. NeMo) and see if the correlation still holds. Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER? This seems promising indeed. |
Yes, that's correct. :) Although I haven't worked with SpeechBrain, I've tested all metrics using
I attached "false alarm" / "missed detection" / "confusion" for scatter plots. |
Thanks. You said you attached false alarm and missed detection plots as well but I don’t see them. |
Sorry, I added missing two images. 😹 |
Thanks. So, as expected, SND correlates with speaker confusion, but not really with false alarm or missed detection, which is kind of expected. Overall, it means that it can be used for tuning clustering hyperparameters but not so much to train segmentation. |
Description
Hello @hbredin and the pyannote community,
I hope this is the appropriate forum for this discussion; I approach with a bit of caution but much enthusiasm.
I've been extensively using pyannote, including pyannote-core, and have even contributed PRs focused on enhancing performance. Today, I'd like to share an idea for a new metric that I've been using in practice, hoping to spark some discussion and gather your thoughts.
While DER and JER are incredibly valuable metrics, they assume the presence of labeled data for speaker diarization. My recent work involves handling large volumes of unlabeled audio, prompting me to think about how we can leverage this data more effectively for speaker diarization.
An insight struck me: by simply using the "number of speakers" present in an audio as a simple label and comparing this reference to the number of speakers predicted through clustering, we can arrive at a fairly robust metric(like MSE with number of speaker). This metric could serve as a valuable tool, especially in scenarios where detailed labeling is unfeasible or unavailable.
I'm considering crafting a pull request to introduce this concept into pyannote and would love to hear your thoughts, feedback, and any insights you might have on this ideation. Would this be of interest to the community? How might we refine or expand upon this idea to make it even more useful?
The text was updated successfully, but these errors were encountered: