Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68

kimdwkimdw · 2024-03-28T11:11:45Z

Description

Hello @hbredin and the pyannote community,

I hope this is the appropriate forum for this discussion; I approach with a bit of caution but much enthusiasm.

I've been extensively using pyannote, including pyannote-core, and have even contributed PRs focused on enhancing performance. Today, I'd like to share an idea for a new metric that I've been using in practice, hoping to spark some discussion and gather your thoughts.

While DER and JER are incredibly valuable metrics, they assume the presence of labeled data for speaker diarization. My recent work involves handling large volumes of unlabeled audio, prompting me to think about how we can leverage this data more effectively for speaker diarization.

An insight struck me: by simply using the "number of speakers" present in an audio as a simple label and comparing this reference to the number of speakers predicted through clustering, we can arrive at a fairly robust metric(like MSE with number of speaker). This metric could serve as a valuable tool, especially in scenarios where detailed labeling is unfeasible or unavailable.

I'm considering crafting a pull request to introduce this concept into pyannote and would love to hear your thoughts, feedback, and any insights you might have on this ideation. Would this be of interest to the community? How might we refine or expand upon this idea to make it even more useful?

hbredin · 2024-03-29T21:35:33Z

This sounds like an interesting idea to investigate.

Do you have experiments backing the claim that this new metric is a good indicator of the actual performance of a diarization system? I would love to know more...

kimdwkimdw · 2024-04-01T07:21:44Z

I've expanded my testing to include VoxConverse_test (0.3) and three internal datasets, applying the "pyannote/speaker-diarization-3.1" model with and without the embedding_exclude_overlap feature. Interestingly, while the SND (Speaker Number Difference) metric shows minimal variation on VoxConverse_test when embedding_exclude_overlap is toggled, significant differences emerge across the internal datasets under the embedding_exclude_overlap=True condition. I plan to test further on widely recognized datasets such as AISHELL, AMI, and DIHARD and will share those findings soon.

Regarding the metric itself, I initially adapted a form of Mean Squared Error (MSE) for simplicity. The formulas are as follows:

Let $n$ be the number of audios.
Let $\text{ref}_n$ be the number of speakers in the $n$-th audio.
Let $\text{hyp}_n$ be the predicted number of speakers in the $n$-th audio from the diarizer.

For Speaker Number Difference, we have two versions:

Mean Error (ME): $\frac{1}{n} \sum_{n=1}^{N} \left| \frac{\text{ref}_n - \text{hyp}_n}{\text{ref}_n} \right|$
Mean Squared Error (MSE): $\frac{1}{n} \sum_{n=1}^{N} \left( \frac{ (\text{ref}_n - \text{hyp}_n)^2 }{\text{ref}_n} \right)$

I will bring measured DER, JER, and SND soon.

kimdwkimdw · 2024-04-08T06:48:24Z

Benchmark (DER)	pyannote v2.1	pyannote v3.1	pyannote Premium	pyannote v3.1 (evaluation by me )(DER / JER / SND_1 / SND_2)
AISHELL-4	14.1	12.2	11.9	9.35 / 14.06 / 31.40 / 70.60
AliMeeting (channel 1)	27.4	24.4	22.5	15.42 / 22.62 / 49.17 / 140.00
AMI (IHM)	18.9	18.8	16.6	11.00 / 15.63 / 52.60 / 156.77
AMI (SDM)	27.1	22.4	20.9	14.79 / 20.65 / 54.94 / 165.88
AVA-AVD	66.3	50	39.8	47.77 / 69.23 / 35.89 / 158.47
CALLHOME (part 2)	31.6	28.4	22.2	N/A
DIHARD 3 (full)	26.9	21.7	17.2	N/A
Earnings21	17	9.4	9	9.67 / 14.85 / 10.83 / 22.20
Ego4D (dev.)	61.5	51.2	43.8	N/A
MSDWild	32.8	25.3	19.8	27.04 / 57.42 / 32.79 / 72.40
RAMC	22.5	22.2	18.4	22.09 / 24.82 / 44.16 / 90.31
REPERE (phase2)	8.2	7.8	7.6	N/A
VoxConverse (v0.3)	11.2	11.3	9.4	9.85 / 32.86 / 20.92 / 58.22

thanks to @upskyy (my co-worker), I've added column pyannote v3.1 (by Ours )(DER / JER / SND_1 / SND_2)

Upon analyzing the metrics, we noticed that while the DER for AVA-AVD was unexpectedly high, indicating a need for further examination, other datasets like AISHELL-4, AliMeeting, DIHARD 3, Earnings21, and MSDWild showed promising results. When DER/JER is high, SND is high.

@hbredin

hbredin · 2024-04-22T14:22:38Z

Sorry for the delay in coming back to this.

Can you please explain the difference between v3.1 and v3.1 (ours)?
I am not sure how this relates to ths inital idea.

I thought the point was to show that the new proposed metric correlated with the legacy DER metric. But those numbers do not really show that. We would need to add SND for v3.1 (original) as well... Or did I miss something?

kimdwkimdw · 2024-04-26T05:04:21Z

Sorry for any confusion in my previous messages.

To clarify, my intention was to demonstrate that the DER metric is correlated with the newly proposed SND metric when using the original v3.1 version of pyannote. The numbers listed under "pyannote v3.1 (by Ours)" were intended to show the evaluation results of both DER and SND across all datasets, to ensure reproducibility and comprehensive understanding.

For clarify, I modify from by Ours to evaluation by me
And I fixed error in AVA-AVD, those numbers now are similar to original metrics.

hbredin · 2024-04-26T06:12:42Z

I am sorry but there is still something that I don’t understand. Why don’t 3.1 and 3.1 (yours) reach the same DER values?

kimdwkimdw · 2024-04-27T08:51:12Z

I'm not entirely sure why there is a discrepancy between the DER values of 3.1 and 3.1 (by me). However, I used the latest pyannote library and recently organized datasets such as AISHELL-4 and AVA-AVD, which I downloaded from the official repository.

While the DER trends between 3.1 and 3.1 (by me) seem generally similar, I too am curious about the reasons for these differences. It appears there might be subtle variations between the two, but since the overall trends in DER are similar, it suggested that measuring SND and DER together would facilitate clearer communication, hence my decision to share all metrics.

Here are the versions of the libraries in my environment:

torch                   2.3.0
torch-audiomentations   0.11.1
torch-pitch-shift       1.2.4
torchaudio              2.3.0
torchmetrics            1.3.2

pyannote.audio          3.1.1
pyannote.core           5.0.0
pyannote.database       5.1.0
pyannote.metrics        3.2
pyannote.pipeline       3.0.1

hbredin · 2024-04-29T06:41:04Z

The output I computed are available here: https://huggingface.co/pyannote/speaker-diarization-3.1/tree/main/reproducible_research

Can you please compare with yours?

kimdwkimdw · 2024-04-30T05:39:52Z

With reproducible_research, I found that I found some errors in above table. Above table was evaluated with "skip_overlap=True". I am sorry for confusing you.

Below table is reproduced again.

Benchmark (DER)	pyannote v2.1	pyannote v3.1	pyannote Premium	pyannote v3.1 (reproduced) (DER / JER / SND_1 / SND_2)
AISHELL-4	14.1	12.2	11.9	12.1 / 17.6 / 31.4 / 70.6
AliMeeting (channel 1)	27.4	24.4	22.5	24.2 / 28.8 / 54.6 / 160.4
AMI (IHM)	18.9	18.8	16.6	18.8 / 23.8 / 51.0 / 152.1
AMI (SDM)	27.1	22.4	20.9	22.7 / 27.9 / 65.6 / 221.9
AVA-AVD	66.3	50	39.8	49.6 / 69.9 / 34.8 / 155.8
CALLHOME (part 2)	31.6	28.4	22.2	N/A
DIHARD 3 (full)	26.9	21.7	17.2	N/A
Earnings21	17	9.4	9	10.0 / 16.1 / 8.3 / 17.9
Ego4D (dev.)	61.5	51.2	43.8	N/A
MSDWild	32.8	25.3	19.8	N/A
RAMC	22.5	22.2	18.4	22.2 / 24.9 / 45.2 / 92.5
REPERE (phase2)	8.2	7.8	7.6	N/A
VoxConverse (v0.3)	11.2	11.3	9.4	11.2 / 34.8 / 20.7 / 56.4

By the way, AVA-AVD's rttm has different number of segment. I have downloaded AVA-AVD recently from https://github.com/zcxu-eric/AVA-AVD/tree/main/dataset using script. For example, record id 1j20qq1JyX4_c_01 has 98 lines of segments in official repository, but 1j20qq1JyX4_c_01 in rttm from reproducible_research has only 81 lines.

So, except AVA-AVD, finally I think I'm on the same page with you.

hbredin · 2024-04-30T06:54:15Z

That's reassuring, thanks :)

I have created scatter plots with your numbers to get a better (visual) idea.
They do seem quite correlated indeed, which is nice.

That being said, my understanding (tell me if I am wrong) is that you would like to use SND_x as a (cheaper than DER) way to tune hyper-parameters.

So, instead of scatter plots accross datasets, one should rather do those plots accross systems (e.g. pyannote 2.1 vs. pyannote 3.1 vs. speechbrain's vs. NeMo) and see if the correlation still holds.

Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER?

This seems promising indeed.

kimdwkimdw · 2024-05-02T01:46:07Z

That being said, my understanding (tell me if I am wrong) is that you would like to use SND_x as a (cheaper than DER) way to tune hyper-parameters.

Yes, that's correct. :)

Although I haven't worked with SpeechBrain, I've tested all metrics using my own model and NeMo.
These metrics proved useful for developing models and monitoring their performance.
They were particularly valuable when assessing model robustness across different domains beyond popular datasets.

Dataset:	aishell_4	alimeeting	AMI (IHM)	AMI (SDM)	AVA-AVD	Earnings21	RAMC	VoxConverse (v0.3)
diarization error rate %	12.11	24.21	18.81	22.64	49.59	10.02	22.19	11.19
correct %	91.8	80.16	84.76	81.16	60.1	92.29	87.03	92.89
false alarm %	3.92	4.37	3.57	3.8	9.69	2.31	9.22	4.08
missed detection %	3.9	10.17	9.53	11.18	16.9	2.91	5.72	3.39
confusion %	4.29	9.67	5.71	7.67	23.01	4.8	7.25	3.73
JER	17.6	28.8	23.8	27.9	69.9	16.1	24.9	34.8
SND_1	31.4	54.6	51	65.6	34.8	8.3	45.2	20.7
SND_2	70.6	160.4	152.1	221.9	155.8	17.9	92.5	56.4

Also, I suspect SND_x metrics are correlated to speaker confusion error rates rather than more global DER. Would it be possible for you to draw those plots as a function of confusion instead of DER?

I attached "false alarm" / "missed detection" / "confusion" for scatter plots.

hbredin · 2024-05-05T05:36:18Z

Thanks. You said you attached false alarm and missed detection plots as well but I don’t see them.

kimdwkimdw · 2024-05-07T02:22:56Z

Sorry, I added missing two images. 😹

hbredin · 2024-05-07T06:09:08Z

Thanks. So, as expected, SND correlates with speaker confusion, but not really with false alarm or missed detection, which is kind of expected.

Overall, it means that it can be used for tuning clustering hyperparameters but not so much to train segmentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68

Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68

kimdwkimdw commented Mar 28, 2024 •

edited

Loading

hbredin commented Mar 29, 2024

kimdwkimdw commented Apr 1, 2024 •

edited

Loading

kimdwkimdw commented Apr 8, 2024 •

edited

Loading

hbredin commented Apr 22, 2024

kimdwkimdw commented Apr 26, 2024 •

edited

Loading

hbredin commented Apr 26, 2024

kimdwkimdw commented Apr 27, 2024

hbredin commented Apr 29, 2024

kimdwkimdw commented Apr 30, 2024

hbredin commented Apr 30, 2024 •

edited

Loading

kimdwkimdw commented May 2, 2024 •

edited

Loading

hbredin commented May 5, 2024

kimdwkimdw commented May 7, 2024

hbredin commented May 7, 2024

Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68

Ideation: Exploring New Metrics for Speaker Diarization in Unlabeled Audio Contexts #68

Comments

kimdwkimdw commented Mar 28, 2024 • edited Loading

Description

hbredin commented Mar 29, 2024

kimdwkimdw commented Apr 1, 2024 • edited Loading

kimdwkimdw commented Apr 8, 2024 • edited Loading

hbredin commented Apr 22, 2024

kimdwkimdw commented Apr 26, 2024 • edited Loading

hbredin commented Apr 26, 2024

kimdwkimdw commented Apr 27, 2024

hbredin commented Apr 29, 2024

kimdwkimdw commented Apr 30, 2024

hbredin commented Apr 30, 2024 • edited Loading

kimdwkimdw commented May 2, 2024 • edited Loading

hbredin commented May 5, 2024

kimdwkimdw commented May 7, 2024

hbredin commented May 7, 2024

kimdwkimdw commented Mar 28, 2024 •

edited

Loading

kimdwkimdw commented Apr 1, 2024 •

edited

Loading

kimdwkimdw commented Apr 8, 2024 •

edited

Loading

kimdwkimdw commented Apr 26, 2024 •

edited

Loading

hbredin commented Apr 30, 2024 •

edited

Loading

kimdwkimdw commented May 2, 2024 •

edited

Loading