why is timestamps generated different for different whisper models? #183
Replies: 1 comment
-
First, you are lucky that the transcription is similar between the two models. And even with exactly the same transcription, there is no reason why two different models should lead to the same alignment. The timestamps are estimated based on the intermediate representations of the neural network (those intermediate representations being cross-attention weights). The only solution I can think of: We could use a unique external model to do the alignment, but it would be cumbersome to run two models instead of one. And also when the two models (the one used for transcription and the one used for alignments) will lead to significantly different transcriptions, the quality of the alignement will be poor. |
Beta Was this translation helpful? Give feedback.
-
i generated timestamps for a audio file and the although the output transcript is the same, the output timestamps are different (i.e they dont align with each other). What could be the reason for this and how to mitigate it?
w/ tiny.pt
'I' -- (0.64,0.76)
'have' -- (0.76,1.04)
'a' -- (1.04,1.14)
'[*]' -- (1.14,1.32)
'meeting' -- (1.32,1.52)
'[*]' -- (1.52,1.78)
'tomorrow' -- (1.78,1.98)
'morning' -- (1.98,2.5)
'at' -- (2.5,2.96)
'10am.' -- (2.96,3.56)
w/ small.pt
'I' -- (0.6,0.8)
'have' -- (0.8,1.04)
'a' -- (1.04,1.18)
'meeting' -- (1.18,1.54)
'tomorrow' -- (1.54,2.04)
'[*]' -- (2.04,2.08)
'morning' -- (2.08,2.56)
'at' -- (2.56,2.9)
'10am.' -- (2.9,3.56)
'Please' -- (3.82,4.08)
'remind' -- (4.08,4.56)
'me.' -- (4.56,4.88)
Beta Was this translation helpful? Give feedback.
All reactions