Replies: 6 comments 26 replies
-
This problem exists in both v3 and v2 and has to do with Whisper hallucinating, especially when encountering periods of silence or no speech, and then it can also get stuck in a sort of loop like this. The difference is that the hallucinations probably just occur in different places in v2 and v3, but they still will happen in both cases. If you like whisper.cpp, you can make a feature request to limit this issue by preprocessing the file to cut out the silent or no speech parts first, then re-inserting those parts in the timestamps in postprocessing. There are other projects that already provide this feature, such as stable-ts and others (if you search the discussion board for hallucinations, you can find several more projects that work on improving the hallucination problem.) |
Beta Was this translation helpful? Give feedback.
-
Thank you for the info. I'll investigate that. In this case, I don't quite believe it's the hallucination. To me, it's quite obvious that the training data is broken. |
Beta Was this translation helpful? Give feedback.
-
Does the training data contain any "negative" examples in an attempt to suppress hallucination? eg. Non-speech sounds of applause, instrumental music, silence, chalk on a blackboard, wind, etc with the desired transcription of these being no text at all. |
Beta Was this translation helpful? Give feedback.
-
Hi, would it be possible to share the audio file or a link to the talk show you're transcribing, we're aware that hallucination is quite annoying, but we can try some hyperparameter tuning which might be helpful as @glangford mentioned. |
Beta Was this translation helpful? Give feedback.
-
I have a similar problem, but with audios with no sounds. More details at: #1606 (comment) |
Beta Was this translation helpful? Give feedback.
-
Yes, I'm using faster-whisper! Sorry for the confusion. faster-whisper uses Silero VAD filter. |
Beta Was this translation helpful? Give feedback.
-
I was glad to hear that OpenAI released the Whisper v3 and tried right after it went online. However, I found the models are slightly poisoned.
I usually batch transcribe some talk shows in English or Mandarin on MBP M2 Max. Some of them are 4-5 hours long so that the official whisper is too slow. So I use whisper.cpp which is insanely fast (10+ times).
After upgrading to large-v3, I found the srt files are full of the following ads. As you can see, the ads repeats every 20 seconds forever. The ads literally mean to subscribe an online channel.
I struggled for quite some time and thought that might be whisper.cpp's problem. So I turned to the official whisper. Unfortunately, it's the same except it's 10+x slower.
I had to revert back to large-v2. The result is much better, but still full of the ads at the beginning of each talk show as follows.
The pattern of the ads seemed to be clear to me. Many subtitle editors like to embed ads in the front of the videos. These videos usually play silence or soft music at the beginning. The whisper models learned the silence or soft music are represented by those ads.
Thus, I concluded that the whisper models are poisoned by the training data. Hope OpenAI address and fix this issue.
Beta Was this translation helpful? Give feedback.
All reactions