-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper no_speech_threshold not applied when chunking input #29595
Comments
I was able to get around this, by using the Vad implementation from faster-whisper.
|
Related to #29508, not sure why there are different implementations for long and short audio. I'm playing around with combining them |
Gentle ping @sanchit-gandhi @ylacombe |
Another ping @ylacombe @sanchit-gandhi @kashif |
Indeed, as @Kimahriman mentioned in #29508 there should be no distinction between the short and long-form algorithms. This issue will be fixed when #29508 is fixed, by merging the short/long-form generation logic together. |
Is this fixed @kamilakesbi @sanchit-gandhi now #29508 is merged? |
cc @ylacombe |
Hey @amyeroberts, this is indeed fixed since #29508. @stri8ed, note that you also have to specify the float e.g: from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import torch
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = librispeech_dummy[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
outputs = model.generate(
input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128, no_speech_threshold=0.2, logprob_threshold=-1.0,
temperature=(0.2, 0.8)
)
transition_scores = model.compute_transition_scores(
outputs.sequences, outputs.scores, normalize_logits=True
)
pred_text = processor.batch_decode(outputs.sequences, skip_special_tokens=True)
pred_language = processor.batch_decode(outputs.sequences[:, 1:2], skip_special_tokens=False)
lang_prob = torch.exp(transition_scores[:, 0])
print(pred_text)
print(pred_language)
print(lang_prob) |
Feature request
The Whisper pipeline accepts a
chunk_length_s
parameter, which chunks the input so it can be used for batch inference. There is also ano_speech_threshold
param, which can be used to filter out silence, which helps with reducing halucinations. The problem is, forno_speech_threshold
to be applied the output must be "long". And in the case of chunked input, every segment is considered short, even though the full input is long. This means theno_speech_threshold
cant be applied for chunked input.When attempting, It gives this error for each batch:
Audio input consists of only 3000. Short-form transcription is activated.no_speech_threshold is set to 0.2, but will be ignored.
It should be possible to keep the no speech threshold enabled for long chunked inputs.
Motivation
Chunking brings large performance improvements for long-form transcription, but this benefit is negated if there is no way to suppress the silence segments, which most long-form audio will no doubt contain.
Your contribution
I could implement a change which removes the
is_shortform
check when the input is chunked, though I'm unsure if and why this will break things.The text was updated successfully, but these errors were encountered: