Whisper Models are Poisoned? #1783

caoccao · 2023-11-09T09:37:52Z

caoccao
Nov 9, 2023

I was glad to hear that OpenAI released the Whisper v3 and tried right after it went online. However, I found the models are slightly poisoned.

I usually batch transcribe some talk shows in English or Mandarin on MBP M2 Max. Some of them are 4-5 hours long so that the official whisper is too slow. So I use whisper.cpp which is insanely fast (10+ times).

After upgrading to large-v3, I found the srt files are full of the following ads. As you can see, the ads repeats every 20 seconds forever. The ads literally mean to subscribe an online channel.

I struggled for quite some time and thought that might be whisper.cpp's problem. So I turned to the official whisper. Unfortunately, it's the same except it's 10+x slower.

I had to revert back to large-v2. The result is much better, but still full of the ads at the beginning of each talk show as follows.

The pattern of the ads seemed to be clear to me. Many subtitle editors like to embed ads in the front of the videos. These videos usually play silence or soft music at the beginning. The whisper models learned the silence or soft music are represented by those ads.

Thus, I concluded that the whisper models are poisoned by the training data. Hope OpenAI address and fix this issue.

ryanheise · 2023-11-09T09:53:28Z

ryanheise
Nov 9, 2023

This problem exists in both v3 and v2 and has to do with Whisper hallucinating, especially when encountering periods of silence or no speech, and then it can also get stuck in a sort of loop like this. The difference is that the hallucinations probably just occur in different places in v2 and v3, but they still will happen in both cases. If you like whisper.cpp, you can make a feature request to limit this issue by preprocessing the file to cut out the silent or no speech parts first, then re-inserting those parts in the timestamps in postprocessing. There are other projects that already provide this feature, such as stable-ts and others (if you search the discussion board for hallucinations, you can find several more projects that work on improving the hallucination problem.)

3 replies

despairTK Nov 9, 2023

I think large-v3 has much more serious problems than large-v2. On the first day I used Whisper's large-v3 to transcribe less than 10 minutes of Japanese audio with repeated sentences.

Then whisper.cpp, which supports large-v3, had the same problem.
ggerganov/whisper.cpp#1444

After faster-whisper supports large-v3, I also tested about a dozen audios, Japanese, Korean, and English, with audio durations ranging from less than 10 minutes to about 1 hour. These tests still have problems with duplicate sentences, incorrect sentence splitting, etc.

The large-v3 model does not seem to be as good as large-v2, and the biggest improvement I've seen with the large-v3 model is that words with similar pronunciations are selected as the correct one. But the rest is not as good as large-v2. I did these tests without using the translation function.

theblackcat102 Nov 18, 2023

Had the same conclusion here. There's some missing transcription in v3. English wise, I don't see much upside to upgrade it

FurkanGozukara Nov 23, 2023

very interesting

i also noticed this. v2 v3 hallucinates whereas v1 not

caoccao · 2023-11-09T10:04:54Z

caoccao
Nov 9, 2023
Author

This problem exists in both v3 and v2 and has to do with Whisper hallucinating, especially when encountering periods of silence or no speech, and then it can also get stuck in a sort of loop like this. The difference is that the hallucinations probably just occur in different places in v2 and v3, but they still will happen in both cases. If you like whisper.cpp, you can make a feature request to limit this issue by preprocessing the file to cut out the silent or no speech parts first, then re-inserting those parts in the timestamps in postprocessing. There are other projects that already provide this feature, such as stable-ts and others (if you search the discussion board for hallucinations, you can find several more projects that work on improving the hallucination problem.)

Thank you for the info. I'll investigate that. In this case, I don't quite believe it's the hallucination. To me, it's quite obvious that the training data is broken.

6 replies

caoccao Nov 9, 2023
Author

Do you know if there was a silent or no-speech region at or prior to when it started misbehaving?

Does it consistently output the same ads in that case? A simple grep over the training data can tell. That's the purpose of my creating this discussion.

ryanheise Nov 9, 2023

I don't have the training data, so I can't do anything about that. Hallucinations are present in all models, one could even say that they are inherent in this type of model, but there are things outside of the models themselves that we can tune to lean towards correct output. There is a section of python code within whisper that is intended to detect repetitions and then increase the temperature to kick itself out of the loop, which has been there since v1, but this section of code does have a couple of ways in which it can fail. One of those ways is the silence or no speech regions, hence my last question, but if you're not interested in investigating that, that's ok.

caoccao Nov 9, 2023
Author

My purpose is to inspire OpenAI folks to sanitize the training data, not to discuss with you on the hallucination. I might jump in another topic discussing the hallucination, but not this one. My avoiding discussing hallucination here doesn't mean I'm interested or not interested in it. Hope we both understand each other and walk away.

ryanheise Nov 9, 2023

It's worth noting that the Whisper whitepaper aimed to show that excellent results could be achieved by sacrificing data quality in order to scale to a massive training dataset size - that was sort of the hypothesis they were testing, and they appear to have been successful in that. So it appears that a massive training set is hugely beneficial even if the data is not perfect, and it is the allowance for imperfections that actually allows you to reach that massive size.

I would agree that better data sanitisation would lead to even better results. Whisper already does some cheap sanitisation.

But I think it would be feasible to also consider a mildly non-cheap form of sanitation that analyses all 5 million hours of audio data by running a VAD algorithm over it, and then filtering all of the spurious text that occurs in a no speech region. After all, this processing need only be done once before training starts.

(But with the models we have today, we can also get more performance out of them by preprocessing with VAD, or by tuning some of the whisper parameters like no_speech_threshold and others.)

ILG2021 Nov 23, 2023

It is very normal, I think whisper use many youtube video for train. So many subtitle in the video has no voice. So when we use whisper in real world and encounter silence voice, it will appear this things. int8 inference will appear more frequently. I solve this problem by changing the inference model to float16 and finetune the model with some noise and silence voices. My model nearly no ads again.

glangford · 2023-11-09T18:38:51Z

glangford
Nov 9, 2023

Does the training data contain any "negative" examples in an attempt to suppress hallucination? eg. Non-speech sounds of applause, instrumental music, silence, chalk on a blackboard, wind, etc with the desired transcription of these being no text at all.

0 replies

jongwook · 2023-11-09T18:42:59Z

jongwook
Nov 9, 2023
Maintainer

Hi, would it be possible to share the audio file or a link to the talk show you're transcribing, we're aware that hallucination is quite annoying, but we can try some hyperparameter tuning which might be helpful as @glangford mentioned.

13 replies

caoccao Nov 18, 2023
Author

Do you mean without the fix? Clearly the v3 output without the fix is "worse" than v2 without the fix.
But with the fix, I only showed the fixed v3 output and not the fixed v2 output, so I'm not sure what you might be comparing it to.

There are a few obvious errors or typos in your revised v3 output. My rough estimate is the error or typo rate between v2 and v3 is subtle, which means human would probably spend the same amount of time correcting them. So I think the quality is the same as well. As I mentioned in my early reply, I could write a script easily clean up most of the ads automatically from v2, so I'm living with v2 for now.

Regarding the simplified or traditional Chinese topic, thank you for the info. Actually, I kind of enjoy whisper's random choice as I'm good at reading / writing both.

ryanheise Nov 18, 2023

OK, I won't be able to do anything about the typos actually, so I should have asked for the less burdensome task of correcting only the hallucinations, if any. I have run my v3 output through a simplified to traditional converter and then did a diff against your corrected output and noticed a few differences in spelling as you suggested, although no text was outright added or removed, so I think this means that at least the hallucinations were removed successfully. I'll try to sort this out in the code. And of course you can alternatively try one of the projects referenced earlier that solve the issue through VAD preprocessing.

ryanheise Nov 25, 2023

I've created PR #1838 which skips over any silence before a detected hallucination that is longer than --hallucination_silence_threshold DURATION in seconds (I used 2 seconds). On the PR page I've included some sample output of your example with a threshold of 2 seconds, and included some debug output to also print out whenever a hallucination was detected. The PR only skips over parts where Whisper didn't detect any speech, although Whisper can still fail to detect speech when there was some, and this PR doesn't address that.

I can confirm this works more reliably with v2 than v3 on your example. For it to work with v3, it has to skip precisely the right amount of silence at the start for it to successfully pick up the start of the first utterance, but v3 has harder picking up the start of that speech. This might be because the speech starting at around 54 seconds is very soft/inaudible at the start. But v2 is more successful at picking up the first utterance.

dgoryeo Dec 19, 2023

@ryanheise , I just noticed that your ryanheise:fix-hallucinations was commited to openai:main. Great job!!! Am I right to understand that both options --hallucination_silence_threshold , and --clip_timestamps are now avilable? Any tips on optimal values?

ryanheise Dec 19, 2023

They are both part of the same PR, so you can use both. I used --hallucination_silence_threshold 2 which will make Whisper vigilant against hallucinations after encountering at least 2 seconds of silence. What is the optimal threshold? There is probably no point in making Whisper vigilant against hallucinations after 0.3 seconds of silence, because that's probably the speaker just taking a silent breath between words, and you don't want to waste processing time on it. On the other hand if you set the bar too high at 15 seconds, then Whisper will not bother to notice hallucinations occurring after 8 seconds of silence. So I wouldn't go lower than 2 seconds, and I wouldn't go higher than 8 seconds. Somewhere in that range there might be an optimal value.

As for --clip_timestamps, this option is there for people who want to use some external tool to find where there is actual speech and where there is silence, and then pass those timestamps into Whisper, instead of asking Whisper to detect it via the previous option. A separate tool that can do this would be something like Silero VAD. That tool will take its own threshold parameters to specify how long a silent period needs to be to register. I suggest not setting it too low, because as above, there is no point in vigilantly skipping over 0.3 seconds of silence and you won't want to waste processing time on it.

turicas · 2023-11-11T21:01:47Z

turicas
Nov 11, 2023

I have a similar problem, but with audios with no sounds. More details at: #1606 (comment)

4 replies

caoccao Nov 11, 2023
Author

Exactly, the problem seems not to be related to the hallucination, but the broken training data.

turicas · 2023-11-13T14:10:30Z

turicas
Nov 13, 2023

There isn't a vad_filter option in Whisper itself, although there are many other unofficial projects that do this, perhaps the one you are referring to is faster-whisper.

Yes, I'm using faster-whisper! Sorry for the confusion. faster-whisper uses Silero VAD filter.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Models are Poisoned? #1783

{{title}}

Replies: 6 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Whisper Models are Poisoned? #1783

Replies: 6 comments · 26 replies

caoccao Nov 9, 2023 Author

caoccao Nov 9, 2023 Author

caoccao Nov 9, 2023 Author

jongwook Nov 9, 2023 Maintainer

caoccao Nov 18, 2023 Author

caoccao Nov 11, 2023 Author

Replies: 6 comments 26 replies

caoccao
Nov 9, 2023
Author

caoccao Nov 9, 2023
Author

caoccao Nov 9, 2023
Author

jongwook
Nov 9, 2023
Maintainer

caoccao Nov 18, 2023
Author

caoccao Nov 11, 2023
Author