Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Open
mrienstra opened this issue Jan 25, 2024 · 0 comments

Comments

@mrienstra
Copy link
Contributor

This is probably an upstream "issue", and it's not a problem per se, more just something unexpected.

@khimaros commented on Dec 1, 2023:

i'm not sure if this is expected, but with medium.en-q5_0, i'm seeing that speaker turns are pretty reliably marked with >>. i'm not using the --diarize or --tdrz flags.

i wasn't seeing this behavior with large-v2, large-v3, or large-v3-q5_0. any thoughts on why that would be happening?

I was curious and tried reproducing this, using the a13.wav sample obtained via make samples from https://upload.wikimedia.org/wikipedia/commons/transcoded/6/6f/Apollo13-wehaveaproblem.ogg/Apollo13-wehaveaproblem.ogg.mp3 (https://commons.wikimedia.org/wiki/File:Apollo13-wehaveaproblem.ogg).

No diarization using: tiny, tiny.en, base, base.en, small, small.en, medium, large-v1, large-v2, large-v2-q5_0, large-v3-q5_0.

Diarization using: medium.en, medium.en-q5_0.bin.

Using the latest from master, 1cf679d. M1 macOS.

[00:00:00.000 --> 00:00:07.000] SC Okay Houston, we've had a problem here.
[00:00:07.000 --> 00:00:12.000] CAPCOM This is Houston. Say again please.
[00:00:12.000 --> 00:00:15.000] SC Houston, we've had a problem. We've had a main B plus 100 volts.
[00:00:15.000 --> 00:00:20.000] CAPCOM Roger. Main B, 100 volts. Okay, standby 13. We're looking at it.
[00:00:20.000 --> 00:00:29.000] SC Okay. Right now, Houston, the voltage is looking good. And we had a pretty large bang
[00:00:29.000 --> 00:00:36.000] associated with the caution and warning amp. And as I recall, main B was the one that had
[00:00:36.000 --> 00:00:39.000] a amp spike on it once before.
[00:00:39.000 --> 00:00:42.000] CAPCOM Roger, Fred.
[00:00:42.000 --> 00:00:48.000] SC And the interim air, we're starting to go ahead and button up the tunnel again.

I found this:

They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas.

-- https://github.com/openai/whisper/blob/main/model-card.md#evaluated-use

... So maybe this is just a weird case, perhaps the medium.en model was trained on that audio sample + a transcript? Wouldn't be too surprising. There are a number of transcripts that use the same speaker identifiers (SC = Spacecraft & CAPCOM = Capsule Communication), e.g. https://nssdc.gsfc.nasa.gov/planetary/lunar/apollo13.pdf

Mostly creating this just to have a placeholder for the topic, as I haven't encountered other discussions. I do recall reading something about how Whisper is trained to suppress this sort of thing... Oh yeah, here we go: openai/whisper#854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant