no_speech_probablity #30777

rizwanishaq · 2024-05-13T13:46:36Z

          The `pipeline` is designed to be a high-level wrapper that goes from audio inputs -> text outputs. Anytime we want something more granular than that, it's best to use the `model` + `processor` API:

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import torch

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = librispeech_dummy[0]["audio"]

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128
)

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

pred_text = processor.batch_decode(outputs.sequences, skip_special_tokens=True)
pred_language = processor.batch_decode(outputs.sequences[:, 1:2], skip_special_tokens=False)
lang_prob = torch.exp(transition_scores[:, 0])

print(pred_text)
print(pred_language)
print(lang_prob)

Print Output:

[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
['<|en|>']
tensor([1.])

Originally posted by @sanchit-gandhi in #25138 (comment)

How we can get the no_speech_probablity with this code?

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-05-13T14:01:40Z

cc @sanchit-gandhi @ylacombe @kamilakesbi

ylacombe · 2024-05-13T15:44:32Z

Hey @rizwanishaq, you can simply add the no_speech_threshold argument to the generate method:

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128, no_speech_threshold=0.2
)

Let me know if that works!

rizwanishaq · 2024-05-13T15:49:41Z

Hey @ylacombe I get this warning "Audio input consists of only 3000. Short-form transcription is activated.no_speech_threshold is set to 0.3, but will be ignored."

ylacombe · 2024-05-13T16:23:36Z

Then it's related to a shortcoming of our Whisper implementation that we hope to fix soon: we're not applying some of the features used for long-form generation to short audios.

We should be resolving this issue quite soon, I'll keep you up-to-date

ylacombe · 2024-05-13T16:27:59Z

This is a duplicate of #29595, I'll close this issue, let's talk in #29595 if you have further questions!

ylacombe · 2024-09-17T08:51:09Z

For reference, it was fixed by #29508!
Also see this comment

amyeroberts added the Audio label May 13, 2024

ylacombe closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no_speech_probablity #30777

no_speech_probablity #30777

rizwanishaq commented May 13, 2024

amyeroberts commented May 13, 2024 •

edited

Loading

ylacombe commented May 13, 2024

rizwanishaq commented May 13, 2024

ylacombe commented May 13, 2024

ylacombe commented May 13, 2024

ylacombe commented Sep 17, 2024

no_speech_probablity #30777

no_speech_probablity #30777

Comments

rizwanishaq commented May 13, 2024

amyeroberts commented May 13, 2024 • edited Loading

ylacombe commented May 13, 2024

rizwanishaq commented May 13, 2024

ylacombe commented May 13, 2024

ylacombe commented May 13, 2024

ylacombe commented Sep 17, 2024

amyeroberts commented May 13, 2024 •

edited

Loading