-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling timestamps changes text/reduces accuracy #30815
Comments
cc @kamilakesbi as well! |
Hey @jaggzh - thanks for reporting. This is actually the intended behaviour with Whisper. To understand why, recall that Whisper predicts the distribution over the next token When we decode without timestamps, we generate sequences with the following format:
Note the task token at index 4: the To decode with timestamps, we ensure that the
=> we can see here that the sequence of token ids changes in two ways:
Since the sequence of token ids Generally, what we observe is that enabling timestamps gives less accurate transcriptions for short-form audio, and more accurate for long-form audio (whether you're using the chunked or sequential decoding algorithms). |
Closing the issue since it is in-fact the intended behaviour from Whisper, but happy to answer any follow-up questions you have! Feel free to post on this comment thread 🤗 |
Thank you so much for the extremely helpful and detailed explanation! Nevertheless, since it's short-form disjoint speech I began working on a project that does some nice automatic breaking up of audio with auto-calibrated silence detection -- and that's a module that operates as a generator function, returning the clip and the time offset, so I can use it in different projects (including my data prep OR prediction code). Thus, with such short utterances, I'm able to then get the timestamp of each clip and that'll be sufficient for my needs. (It's not on topic, but if anyone's interested (not that they'll see this closed issue))... It handles evaluating a provided audio file (file only right now.. can't yet use it on a live audio stream). It examines requested seconds of audio (chunk) and, within that small examination windows for each of their max amplitudes. (It considers the lowest of those as the noise floor). It then evaluates the max it heard (discards some (maxamp_discard_frac)), to take a fraction between the floor and that max as the acceptable signal (voice) level. The purpose was to automatically adjust, instead of using fixed dB of many solutions I found. If plotting, it ends up using my non-breaking key module (kbnb) -- that import can just be left out if not using it. Otherwise that's included in the gist, along with bansi.py for some perdy colors also used in the plotting. In any case, it's also a good example of matplotlib running and updating its window in the bg, non-blocking. :) |
I have a new idea, since timestamps are useful, and accuracy is useful. Two possible variations:
By using a dynamic stripping, we can choose, each pass, which timestamp tokens we keep, with the idea being that the attention head(s) can match up enough of the audio features to transcription tokens to maintain the next token accuracy. When we expect a timestamp token we can include a prior timestamp closer to the last token. (We could also attempt to force a timestamp or token prediction, as needed, with prefix_allowed_tokens_fn, for example. But this could either be optional or an experimental part of the algorithm -- or with an adjustable token spacing.) |
That's correct @jaggzh - the model is trained to predict timestamps to 0.2f precision during training. See page 3 of the whisper paper for details. Changing the precision of the timestamps is unlikely to get you any improvements in transcription accuracy. In fact, you risk potentially lower timestamp accuracy as you generate since you deviate away from the most probably predictions. Regarding modifying the decoding algorithm: If you want to be able to predict timestamps at index One option to try and get the best of both worlds is what they do in Whisper-X - use Whisper for the transcriptions, but wav2vec2 for the timestamps. |
I'm so sorry -- I'm referring to the accuracy of transcription being maintained, not timestamp accuracy changed. |
System Info
transformers
version: 4.40.2- distributed_type: NO
- mixed_precision: no
- use_cpu: False
- debug: False
- num_processes: 1
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
- dynamo_config: {'dynamo_backend': 'INDUCTOR'}
Who can help?
@sanchit-gandhi @ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
return_timestamps=True
in the generate() callreturn_timestamps=True
Expected behavior
The text with and without timestamps "should" match, no? But with timestamps it somehow interferes, changing the text and, in this case, decreasing its accuracy.
This is a fine-tuned model, with a complex voice (patient whispers, breathing on a ventilator), and so far with insufficient data for better training. My point here is that I believe the model will therefore be more susceptible to influences that can deteriorate its recognition. However, my main questions are:
generation(..., return_timestamps=True)
end up affecting the whole process?My code (it's a bit of a mess as I experiment):
With generate()'s
return_timestamps=True
:Predicted id [0] text: <|startoftranscript|><|en|><|transcribe|> There is a time... ...of a subconscious development. It don't work. Bureau work. The branch, the branch.<|endoftext|>
Predicted id [0] offsets: [{'text': ' There is a time...', 'timestamp': (0.0, 2.6)}, {'text': ' ...of a subconscious development.', 'timestamp': (14.6, 17.6)}, {'text': " It don't work.", 'timestamp': (20.6, 22.6)}, {'text': ' Bureau work.', 'timestamp': (23.400000000000002, 24.400000000000002)}, {'text': ' The branch, the branch.', 'timestamp': (25.6, 27.6)}]
Without generate()'s
return_timestamps=True
:Predicted id [0] text: <|startoftranscript|><|en|><|transcribe|><|notimestamps|> there is it time... what is that chin? round one? you know what? the brown strap is... the brown strap is...<|endoftext|>
Predicted id [0] offsets: []
Full code below. (Please don't look at it unless you have to!)
The text was updated successfully, but these errors were encountered: