Whisper no_speech_threshold not applied when chunking input #29595

stri8ed · 2024-03-11T17:25:41Z

Feature request

The Whisper pipeline accepts a chunk_length_s parameter, which chunks the input so it can be used for batch inference. There is also a no_speech_threshold param, which can be used to filter out silence, which helps with reducing halucinations. The problem is, for no_speech_threshold to be applied the output must be "long". And in the case of chunked input, every segment is considered short, even though the full input is long. This means the no_speech_threshold cant be applied for chunked input.

When attempting, It gives this error for each batch:
Audio input consists of only 3000. Short-form transcription is activated.no_speech_threshold is set to 0.2, but will be ignored.

It should be possible to keep the no speech threshold enabled for long chunked inputs.

Motivation

Chunking brings large performance improvements for long-form transcription, but this benefit is negated if there is no way to suppress the silence segments, which most long-form audio will no doubt contain.

Your contribution

I could implement a change which removes the is_shortform check when the input is chunked, though I'm unsure if and why this will break things.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-03-11T19:13:10Z

cc @sanchit-gandhi @ylacombe

stri8ed · 2024-03-11T19:22:33Z

I was able to get around this, by using the Vad implementation from faster-whisper.

from faster_whisper.vad import get_speech_timestamps, VadOptions, collect_chunks, SpeechTimestampsMap

audio_data = load_audio(audio_file)
speech_chunks = get_speech_timestamps(audio_data, VadOptions())
audio_without_silence = collect_chunks(audio_data, speech_chunks)

prediction = pipe(audio_without_silence , ...)

chunks = prediction["chunks"]
ts_map = SpeechTimestampsMap(speech_chunks, sampling_rate)
segments = []
    for chunk in chunks:
        start, end = chunk["timestamp"]
        segments.append({
            "start": ts_map.get_original_time(start),
            "end": ts_map.get_original_time(end),
            "text": chunk["text"]
        })

Kimahriman · 2024-03-13T19:37:06Z

Related to #29508, not sure why there are different implementations for long and short audio. I'm playing around with combining them

amyeroberts · 2024-04-11T08:48:18Z

Gentle ping @sanchit-gandhi @ylacombe

amyeroberts · 2024-05-07T10:01:58Z

Another ping @ylacombe @sanchit-gandhi @kashif

sanchit-gandhi · 2024-05-22T12:51:00Z

Indeed, as @Kimahriman mentioned in #29508 there should be no distinction between the short and long-form algorithms. This issue will be fixed when #29508 is fixed, by merging the short/long-form generation logic together.

amyeroberts · 2024-08-05T15:29:07Z

Is this fixed @kamilakesbi @sanchit-gandhi now #29508 is merged?

amyeroberts · 2024-09-16T11:44:47Z

cc @ylacombe

ylacombe · 2024-09-17T08:49:54Z

Hey @amyeroberts, this is indeed fixed since #29508.

@stri8ed, note that you also have to specify the float logprob_threshold and the list temperature, as indicated in the docs !

e.g:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import torch

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = librispeech_dummy[0]["audio"]

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128, no_speech_threshold=0.2, logprob_threshold=-1.0,
    temperature=(0.2, 0.8)
)

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

pred_text = processor.batch_decode(outputs.sequences, skip_special_tokens=True)
pred_language = processor.batch_decode(outputs.sequences[:, 1:2], skip_special_tokens=False)
lang_prob = torch.exp(transition_scores[:, 0])

print(pred_text)
print(pred_language)
print(lang_prob)

amyeroberts added bug Audio labels Mar 11, 2024

huggingface deleted a comment from github-actions bot Apr 11, 2024

huggingface deleted a comment from github-actions bot May 7, 2024

ylacombe mentioned this issue May 13, 2024

no_speech_probablity #30777

Closed

kamilakesbi assigned ylacombe May 17, 2024

sanchit-gandhi unassigned ylacombe May 22, 2024

sanchit-gandhi assigned kamilakesbi May 22, 2024

huggingface deleted a comment from github-actions bot Jun 16, 2024

huggingface deleted a comment from github-actions bot Jul 11, 2024

huggingface deleted a comment from github-actions bot Aug 5, 2024

huggingface deleted a comment from github-actions bot Sep 16, 2024

ylacombe closed this as completed Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper no_speech_threshold not applied when chunking input #29595

Whisper no_speech_threshold not applied when chunking input #29595

stri8ed commented Mar 11, 2024 •

edited

Loading

amyeroberts commented Mar 11, 2024

stri8ed commented Mar 11, 2024 •

edited

Loading

Kimahriman commented Mar 13, 2024

amyeroberts commented Apr 11, 2024

amyeroberts commented May 7, 2024

sanchit-gandhi commented May 22, 2024

amyeroberts commented Aug 5, 2024

amyeroberts commented Sep 16, 2024

ylacombe commented Sep 17, 2024 •

edited

Loading

Whisper no_speech_threshold not applied when chunking input #29595

Whisper no_speech_threshold not applied when chunking input #29595

Comments

stri8ed commented Mar 11, 2024 • edited Loading

Feature request

Motivation

Your contribution

amyeroberts commented Mar 11, 2024

stri8ed commented Mar 11, 2024 • edited Loading

Kimahriman commented Mar 13, 2024

amyeroberts commented Apr 11, 2024

amyeroberts commented May 7, 2024

sanchit-gandhi commented May 22, 2024

amyeroberts commented Aug 5, 2024

amyeroberts commented Sep 16, 2024

ylacombe commented Sep 17, 2024 • edited Loading

stri8ed commented Mar 11, 2024 •

edited

Loading

stri8ed commented Mar 11, 2024 •

edited

Loading

ylacombe commented Sep 17, 2024 •

edited

Loading