Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no_speech_probablity #30777

Closed
rizwanishaq opened this issue May 13, 2024 · 6 comments
Closed

no_speech_probablity #30777

rizwanishaq opened this issue May 13, 2024 · 6 comments
Labels

Comments

@rizwanishaq
Copy link

          The `pipeline` is designed to be a high-level wrapper that goes from audio inputs -> text outputs. Anytime we want something more granular than that, it's best to use the `model` + `processor` API:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import torch

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")

librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = librispeech_dummy[0]["audio"]

input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128
)

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)

pred_text = processor.batch_decode(outputs.sequences, skip_special_tokens=True)
pred_language = processor.batch_decode(outputs.sequences[:, 1:2], skip_special_tokens=False)
lang_prob = torch.exp(transition_scores[:, 0])

print(pred_text)
print(pred_language)
print(lang_prob)

Print Output:

[' Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.']
['<|en|>']
tensor([1.])

Originally posted by @sanchit-gandhi in #25138 (comment)

How we can get the no_speech_probablity with this code?

@amyeroberts
Copy link
Collaborator

amyeroberts commented May 13, 2024

@ylacombe
Copy link
Contributor

Hey @rizwanishaq, you can simply add the no_speech_threshold argument to the generate method:

outputs = model.generate(
    input_features, output_scores=True, return_dict_in_generate=True, max_new_tokens=128, no_speech_threshold=0.2
)

Let me know if that works!

@rizwanishaq
Copy link
Author

Hey @ylacombe I get this warning "Audio input consists of only 3000. Short-form transcription is activated.no_speech_threshold is set to 0.3, but will be ignored."

@ylacombe
Copy link
Contributor

Then it's related to a shortcoming of our Whisper implementation that we hope to fix soon: we're not applying some of the features used for long-form generation to short audios.

We should be resolving this issue quite soon, I'll keep you up-to-date

@ylacombe
Copy link
Contributor

This is a duplicate of #29595, I'll close this issue, let's talk in #29595 if you have further questions!

@ylacombe
Copy link
Contributor

For reference, it was fixed by #29508!
Also see this comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants