[whisper] alternative fix for long-form timestamps #32131

sanchit-gandhi · 2024-07-22T07:29:05Z

What does this PR do?

Fixes #31942 and supersedes #32003 by implementing new logic in the tokenizer method _compute_offsets.

The advantages of this PR over #32003 are:

Works with the processor and tokenizer standalone
No code changes for the user when passing args to the processor

from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en")

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

inputs = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt", truncation=False)

output = model.generate(**inputs, return_timestamps=True, return_segments=True)

# pass token ids to processor's decode method
result = processor.batch_decode(output["sequences"], skip_special_tokens=True, output_offsets=True)
print(result)

# we can also pass them directly to the tokenizer (not possible with #32003)
result = processor.tokenizer.batch_decode(output["sequences"], skip_special_tokens=True, output_offsets=True)

src/transformers/models/whisper/processing_whisper.py

src/transformers/models/whisper/tokenization_whisper.py

src/transformers/models/whisper/tokenization_whisper_fast.py

sanchit-gandhi · 2024-07-22T07:35:49Z

src/transformers/models/whisper/tokenization_whisper.py

@@ -589,36 +587,33 @@ def _compute_offsets(self, token_ids, time_precision=0.02, longform_timestamps=N
            consecutive = np.append(consecutive, np.where(timestamp_tokens)[0][-1] + 1)

        last_slice = np.where(timestamp_tokens)[0][0]
-        for i, current_slice in enumerate(consecutive):
+        cur_max_timestamp = 0


With no change to the overall processor/tokenizer design, we can fix the original timestamp issue by keeping track of the last timestamp predicted

HuggingFaceDocBuilderDev · 2024-07-22T07:47:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kamilakesbi · 2024-07-22T08:48:58Z

@sanchit-gandhi thanks for working on this, this is indeed a better solution that in PR #32003!

Could you make sure that the slow test test_tiny_longform_timestamps_generation implemented in PR #32003 is still passing ?

Thank you!

sanchit-gandhi · 2024-07-22T09:04:50Z

The test does indeed pass. I've updated it to use a different dataset that makes it a bit easier to see the effect of the correct timestamps (a 1 mins sample of LibriSpeech, rather than 10 concatenated samples of the same utterance).

ylacombe

Hey @sanchit-gandhi, thanks for this PR and sorry for not having catch the issue with the previous PR!

While I understand most the logic of the PR, I'm still struggling with an edge case that I might have misunderstood:
The way I see it, Whisper processes chunks of 30s of audios, but sometimes no speech is detected on the last seconds of a chunk.
For example, in the tiny test you added, the first 29.44 seconds contain the following speech:

<|0.00|> Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.<|6.56|><|6.56|> Nor is Mr. Quilter's manner less interesting than his matter.<|11.24|><|11.24|> He tells us that at this festive season of the year, with Christmas and roast beef looming<|16.88|><|16.88|> before us, similarly drawn from eating and its results occur most readily to the mind.<|23.76|><|23.76|> He has grave doubts whether Sir Frederick Latins' work is really Greek after all, and<|29.44|>.

The next 30 seconds contains something that begins with :
<|0.00|> can discover in it but little of rocky ithaka.<|4.28|><|4.28|> Lennils, pictures, are a sort of upguards and atom paintings, and Mason's exquisite itals<|10.88|><|10.88|> are as national as a jingo poem.<|15.28|>

So there's a small bit of audio in between the two chunks (30 - 29.44 = 0.56), that doesn't seem to be taken into account when computing prev_segments_len.

For this particular example, it doesn't matter that much but let's say we have a 45s audio sample with speech in the first 15s and in the last 15s, and no speech in-between.

In that case, prev_segments_len will be corresponding to 15s so the last 15s will be detected starting at 00:15 instead of 00:30, is that correct? If so, this is not what we want, right ?

Let me know what you think!

sanchit-gandhi · 2024-08-02T03:09:40Z

let's say we have a 45s audio sample with speech in the first 15s and in the last 15s, and no speech in-between

Here's a script for creating such an example and running inference:

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, AutoProcessor
import numpy as np

# load model + processor
processor = AutoProcessor.from_pretrained("openai/whisper-small.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small.en")

# load dataset
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]["array"]
sampling_rate = dataset[0]["audio"]["sampling_rate"]

# Contrived audio sample
# 1. First 15-seconds: standard speech
# 2. Next 15-seconds: silence (zeroes)
# 3. Last 15-seconds: speech

sample = [*sample[:15 * sampling_rate], *np.zeros(15 * sampling_rate).tolist(), *sample[15 * sampling_rate:]]
sample = np.array(sample)

# pre-process
inputs = processor(
    sample,
    sampling_rate=16_000,
    padding="longest",
    truncation=False,
    return_attention_mask=True,
    return_tensors="pt",
)

# inference
output = model.generate(**inputs, return_timestamps=True, return_segments=True)

# pass token ids to processor's decode method
result = processor.batch_decode(output["sequences"], skip_special_tokens=True, output_offsets=True)
print(result)

What we get for this Transformers PR (formatted in VTT style):

00:00.000 --> 00:06.480  Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
00:06.480 --> 00:11.280  Nor is Mr. Quilter's manner less interesting than his matter.
00:11.280 --> 00:30.160  He tells us that at this festive season of the year, with…
00:30.160 --> 00:35.780  Christmas and roast beef looming before us, similes drawn from eating and its results
00:35.780 --> 00:38.800  occur most readily to the mind.
00:38.800 --> 00:44.560  His grave doubts whether Sir Frederick Leighton's work is really Greek after all and can
00:44.560 --> 00:45.560  discover.

What we get for original Whisper:

00:00.000 --> 00:06.480  Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.
00:06.480 --> 00:11.280  Nor is Mr. Quilter's manner less interesting than his matter.
00:11.280 --> 00:30.160  He tells us that at this festive season of the year, with...
00:30.160 --> 00:35.780  Christmas and roast beef looming before us, similes drawn from eating and its results
00:35.780 --> 00:38.760  occur most readily to the mind.
00:38.760 --> 00:44.560  He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can

=> the start/end timings vary slightly between implementations (as a result of numerical precision), but both handle the long period of silence in the same way: they lump it into the preceding segment, giving an end timestamp that encompasses all of the silence:

00:11.280 --> 00:30.160  He tells us that at this festive season of the year, with…

github-actions · 2024-08-26T08:04:04Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ylacombe

Thanks for the explanation @sanchit-gandhi, LGTM!

cc @ArthurZucker for review!

ArthurZucker

Thanks both 🤗

* [whisper] alternative fix for long-form timestamps * update test

sanchit-gandhi mentioned this pull request Jul 22, 2024

Incorrect Whisper long-form decoding timestamps #32003

Merged

sanchit-gandhi commented Jul 22, 2024

View reviewed changes

sanchit-gandhi force-pushed the alternative-whisper-fix branch from 2da3c02 to 1244dbb Compare July 22, 2024 09:04

sanchit-gandhi requested a review from ylacombe July 22, 2024 09:37

sanchit-gandhi added 2 commits July 24, 2024 13:56

[whisper] alternative fix for long-form timestamps

f3f0ba4

update test

7f8938f

sanchit-gandhi force-pushed the alternative-whisper-fix branch from 1244dbb to 7f8938f Compare July 24, 2024 05:57

ylacombe reviewed Jul 29, 2024

View reviewed changes

ylacombe approved these changes Sep 3, 2024

View reviewed changes

ArthurZucker approved these changes Sep 6, 2024

View reviewed changes

ArthurZucker merged commit 51d15eb into huggingface:main Sep 6, 2024
21 checks passed

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Sep 6, 2024

[whisper] alternative fix for long-form timestamps (huggingface#32131)

44bc607

* [whisper] alternative fix for long-form timestamps * update test

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

[whisper] alternative fix for long-form timestamps (huggingface#32131)

94bc73d

* [whisper] alternative fix for long-form timestamps * update test

wallrothm mentioned this pull request Oct 28, 2024

WhisperTokenizer decode is offsetting timestamps incorrectly #34472

Closed

4 tasks

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

[whisper] alternative fix for long-form timestamps (huggingface#32131)

45cecbb

* [whisper] alternative fix for long-form timestamps * update test

BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024

[whisper] alternative fix for long-form timestamps (huggingface#32131)

a151e52

* [whisper] alternative fix for long-form timestamps * update test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[whisper] alternative fix for long-form timestamps #32131

[whisper] alternative fix for long-form timestamps #32131

sanchit-gandhi commented Jul 22, 2024 •

edited

Loading

sanchit-gandhi Jul 22, 2024

HuggingFaceDocBuilderDev commented Jul 22, 2024

kamilakesbi commented Jul 22, 2024

sanchit-gandhi commented Jul 22, 2024

ylacombe left a comment •

edited

Loading

sanchit-gandhi commented Aug 2, 2024

github-actions bot commented Aug 26, 2024

ylacombe left a comment

ArthurZucker left a comment

[whisper] alternative fix for long-form timestamps #32131

[whisper] alternative fix for long-form timestamps #32131

Conversation

sanchit-gandhi commented Jul 22, 2024 • edited Loading

What does this PR do?

sanchit-gandhi Jul 22, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jul 22, 2024

kamilakesbi commented Jul 22, 2024

sanchit-gandhi commented Jul 22, 2024

ylacombe left a comment • edited Loading

Choose a reason for hiding this comment

sanchit-gandhi commented Aug 2, 2024

github-actions bot commented Aug 26, 2024

ylacombe left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi commented Jul 22, 2024 •

edited

Loading

ylacombe left a comment •

edited

Loading