Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

mrienstra · 2024-01-25T22:28:49Z

This is probably an upstream "issue", and it's not a problem per se, more just something unexpected.

@khimaros commented on Dec 1, 2023:

i'm not sure if this is expected, but with medium.en-q5_0, i'm seeing that speaker turns are pretty reliably marked with >>. i'm not using the --diarize or --tdrz flags.

i wasn't seeing this behavior with large-v2, large-v3, or large-v3-q5_0. any thoughts on why that would be happening?

I was curious and tried reproducing this, using the a13.wav sample obtained via make samples from https://upload.wikimedia.org/wikipedia/commons/transcoded/6/6f/Apollo13-wehaveaproblem.ogg/Apollo13-wehaveaproblem.ogg.mp3 (https://commons.wikimedia.org/wiki/File:Apollo13-wehaveaproblem.ogg).

No diarization using: tiny, tiny.en, base, base.en, small, small.en, medium, large-v1, large-v2, large-v2-q5_0, large-v3-q5_0.

Diarization using: medium.en, medium.en-q5_0.bin.

Using the latest from master, 1cf679d. M1 macOS.

[00:00:00.000 --> 00:00:07.000] SC Okay Houston, we've had a problem here.
[00:00:07.000 --> 00:00:12.000] CAPCOM This is Houston. Say again please.
[00:00:12.000 --> 00:00:15.000] SC Houston, we've had a problem. We've had a main B plus 100 volts.
[00:00:15.000 --> 00:00:20.000] CAPCOM Roger. Main B, 100 volts. Okay, standby 13. We're looking at it.
[00:00:20.000 --> 00:00:29.000] SC Okay. Right now, Houston, the voltage is looking good. And we had a pretty large bang
[00:00:29.000 --> 00:00:36.000] associated with the caution and warning amp. And as I recall, main B was the one that had
[00:00:36.000 --> 00:00:39.000] a amp spike on it once before.
[00:00:39.000 --> 00:00:42.000] CAPCOM Roger, Fred.
[00:00:42.000 --> 00:00:48.000] SC And the interim air, we're starting to go ahead and button up the tunnel again.

I found this:

They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas.

-- https://github.com/openai/whisper/blob/main/model-card.md#evaluated-use

... So maybe this is just a weird case, perhaps the medium.en model was trained on that audio sample + a transcript? Wouldn't be too surprising. There are a number of transcripts that use the same speaker identifiers (SC = Spacecraft & CAPCOM = Capsule Communication), e.g. https://nssdc.gsfc.nasa.gov/planetary/lunar/apollo13.pdf

Mostly creating this just to have a placeholder for the topic, as I haven't encountered other discussions. I do recall reading something about how Whisper is trained to suppress this sort of thing... Oh yeah, here we go: openai/whisper#854

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

mrienstra commented Jan 25, 2024

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Comments

mrienstra commented Jan 25, 2024