why is timestamps generated different for different whisper models? #183

afsara-ben · 2024-04-04T22:34:04Z

afsara-ben
Apr 4, 2024

i generated timestamps for a audio file and the although the output transcript is the same, the output timestamps are different (i.e they dont align with each other). What could be the reason for this and how to mitigate it?

w/ tiny.pt
'I' -- (0.64,0.76)
'have' -- (0.76,1.04)
'a' -- (1.04,1.14)
'[*]' -- (1.14,1.32)
'meeting' -- (1.32,1.52)
'[*]' -- (1.52,1.78)
'tomorrow' -- (1.78,1.98)
'morning' -- (1.98,2.5)
'at' -- (2.5,2.96)
'10am.' -- (2.96,3.56)

w/ small.pt
'I' -- (0.6,0.8)
'have' -- (0.8,1.04)
'a' -- (1.04,1.18)
'meeting' -- (1.18,1.54)
'tomorrow' -- (1.54,2.04)
'[*]' -- (2.04,2.08)
'morning' -- (2.08,2.56)
'at' -- (2.56,2.9)
'10am.' -- (2.9,3.56)
'Please' -- (3.82,4.08)
'remind' -- (4.08,4.56)
'me.' -- (4.56,4.88)

Jeronymous · 2024-04-05T01:49:56Z

Jeronymous
Apr 5, 2024
Maintainer

First, you are lucky that the transcription is similar between the two models.

And even with exactly the same transcription, there is no reason why two different models should lead to the same alignment.

The timestamps are estimated based on the intermediate representations of the neural network (those intermediate representations being cross-attention weights).
Two different neural networks will have different intermediate representations for the same input.
There is nothing we can do ensure that all possible models will produce the same alignement between an audio and a transcription.

The only solution I can think of: We could use a unique external model to do the alignment, but it would be cumbersome to run two models instead of one. And also when the two models (the one used for transcription and the one used for alignments) will lead to significantly different transcriptions, the quality of the alignement will be poor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why is timestamps generated different for different whisper models? #183

{{title}}

Replies: 1 comment

{{title}}

Select a reply

why is timestamps generated different for different whisper models? #183

afsara-ben Apr 4, 2024

Replies: 1 comment

Jeronymous Apr 5, 2024 Maintainer

afsara-ben
Apr 4, 2024

Jeronymous
Apr 5, 2024
Maintainer