-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Whisper long-form decoding timestamps #31942
Comments
Additionally, the bug does not happen when I add results = pipe(
sample,
chunk_length_s=30,
return_timestamps=True,
generate_kwargs={
"language": "english",
},
) However this workaround is not applicable to me because I also would like to supply the |
cc @kamilakesbi |
Hi @Robinysh, You could use this workaround before we properly integrate the solution in Transformers: import numpy as np
import json
from transformers import AutoProcessor, WhisperForConditionalGeneration
import torch
from datasets import load_dataset
device = "cuda"
torch_dtype = torch.bfloat16
processor = AutoProcessor.from_pretrained("distil-whisper/distil-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("distil-whisper/distil-large-v3", torch_dtype=torch.float16)
model = model.to("cuda")
dataset = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
sample = dataset[0]["audio"]
sample = np.concatenate([sample["array"]] * 10)
inputs = processor(sample, return_tensors="pt", truncation=False, sampling_rate=16_000)
inputs = inputs.to("cuda", torch.float16)
output = model.generate(**inputs, return_timestamps=True, return_segments = True)
result = processor.batch_decode(output['sequences'], skip_special_tokens=True, output_offsets = True)
for i in range(len(result[0]['offsets'])):
result[0]['offsets'][i]['timestamp'] = (output['segments'][0][i]['start'].item(), output['segments'][0][i]['end'].item())
print(json.dumps(result, indent=4)) Explanation:When performing long form generation with Whisper, the right utterance level timestamps are returned as output to generate when we specify The problem arise at the decoding level: One simple solution is to replace the obtained timestamps with the ones stored in output[ for i in range(len(result[0]['offsets'])):
result[0]['offsets'][i]['timestamp'] = (output['segments'][0][i]['start'].item(), output['segments'][0][i]['end'].item()) cc @sanchit-gandhi @ylacombe ( We should integrate this properly in batch_decode and also handle it in the automatic speech recognition pipeline, I'll open a PR for that :) ) |
Any estimated time for a solution to this issue? |
cc @eustlb - I'm not sure if this is fixed already? |
Related to #34210 and not fixed yet for the pipeline. In the meantime, please run: import numpy as np
from transformers import pipeline, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset, Audio
device = "cuda"
torch_dtype = torch.bfloat16
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=False, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sample = dataset[0]["audio"]
sample = np.concatenate([sample["array"]] * 10)
input_features = processor(
sample, return_tensors="pt", truncation=False, sampling_rate=16000
).input_features
input_features = input_features.to(device, torch_dtype)
generated_ids = model.generate(input_features, return_timestamps=True, return_segments=True)
transcript = processor.batch_decode(generated_ids["sequences"], skip_special_tokens=True, output_offsets=True)
for el in transcript[0]["offsets"]:
print(el) |
Fixed in #35750 that will be merged ASAP! Thanks a lot for raising this issue, and thanks a lot for your patience 🤗 |
@eustlb Looks like its not yet merged! Came across the issue today, will probably try suggested alternate solutions for now. |
System Info
transformers
version: 4.42.4Who can help?
@Narsil @sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Output
Expected behavior
Currently the timestamp resets to zero after 30s of audio. I expect the timestamps to increase monotonically.
The text was updated successfully, but these errors were encountered: