You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Decoding the output from Whisper using the WhisperTokenizer is seemingly offsetting the timestamps incorrectly in consecutive chunks, which for long audios leads to timestamp accuracy significantly degrading over time.
I have not found any open bug report on this matter. The issue #31942 and the PR intended to fix it #32131 are related, and hence I've added @sanchit-gandhi to this issue as well.
From my understanding, the above mentioned PR solves it under the assumption that the predicted timestamps at all time spans the entire previous chunk and thus incrementing the timestamp in consecutive chunks based on the cur_max_timestamp would solve it. However, cur_max_timestamp is not generally correctly offsetting the timestamps. The example described in #32131 (comment) does generate the correct output, but unfortunately slightly altering the silence leads to incorrect timestamps.
The following snippets should reproduce the issue (simply increasing silence from 15s to 16s):
0.00 -> 6.38 : Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
6.38 -> 11.32 : Nor is Mr. Quilter's manner less interesting than his matter.
11.32 -> 15.00 : He tells us that at this festive season of the year,
15.00 -> 21.76 : With Christmas and roast beef looming before us, similes drawn from eating and its results
21.76 -> 24.80 : occur most readily to the mind.
24.80 -> 30.38 : He has grave doubts whether Sir Frederick Layton's work is really Greek after all and
30.38 -> 34.00 : can discover in it but little of rocky Ithaca.
34.00 -> 41.28 : Lenell's pictures are a sort of up-guards-and-atom paintings, and Mason's exquisite ittles
41.28 -> 49.12 : are as national as a jingo poem. Mr. Burkett fosters landscape's smile at one much in
49.12 -> 55.76 : the same way that Mr. Karker used to flash his teeth. And Mr. John Collier gives his
55.76 -> 62.16 : sitter a cheerful slap on the back before he says, like a shampoo or in a Turkish bath,
62.16 -> 63.16 : Next Man
while inspecting the output["segments"] gives the following segment timestamps:
which in turn are close to the output https://github.com/openai/whisper generates and sort of explains how the fourth segment becomes 15.00 -> 21.76 instead of the expected 30.00 -> 36.76.
Expected behavior
I would expect that the WhisperTokenizer can correctly decode handle offsets in timestamps and not have timestamps become misaligned to the corresponding chunk.
The text was updated successfully, but these errors were encountered:
There's indeed a problem with WhisperTokenizer: I confirmed by running my forked version of the original Whisper implem (see this #34111 for more info) on input features built as you mentioned above, then saved the generated tokens and passed them through result = processor.batch_decode(output["sequences"], skip_special_tokens=True, output_offsets=True). I see the same issue you described with 15.00 -> 21.76 instead of the expected 30.00 -> 36.76. Let me open a PR to solve it.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.47.0.dev0Who can help?
@ylacombe
@eustlb
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Decoding the output from Whisper using the WhisperTokenizer is seemingly offsetting the timestamps incorrectly in consecutive chunks, which for long audios leads to timestamp accuracy significantly degrading over time.
I have not found any open bug report on this matter. The issue #31942 and the PR intended to fix it #32131 are related, and hence I've added @sanchit-gandhi to this issue as well.
From my understanding, the above mentioned PR solves it under the assumption that the predicted timestamps at all time spans the entire previous chunk and thus incrementing the timestamp in consecutive chunks based on the
cur_max_timestamp
would solve it. However,cur_max_timestamp
is not generally correctly offsetting the timestamps. The example described in #32131 (comment) does generate the correct output, but unfortunately slightly altering the silence leads to incorrect timestamps.The following snippets should reproduce the issue (simply increasing silence from 15s to 16s):
which results in the following output:
while inspecting the
output["segments"]
gives the following segment timestamps:which in turn are close to the output https://github.com/openai/whisper generates and sort of explains how the fourth segment becomes
15.00 -> 21.76
instead of the expected30.00 -> 36.76
.Expected behavior
I would expect that the
WhisperTokenizer
can correctly decode handle offsets in timestamps and not have timestamps become misaligned to the corresponding chunk.The text was updated successfully, but these errors were encountered: