Fix missing `sequences_scores` in the Whisper beam search output #32970

Nik-Kras · 2024-08-22T20:02:06Z

What does this PR do?

Fixed missing sequences_scores in the Whisper beam search output .
Transformers v4.44.1 still has a bug with missing sequences_scores even when these values are explicitly requested. Below is the code snippet that you could use in the current version and get an error due to missing sequences_scores, while it will work fine with version 4.25.1

To reproduce the experiment, download any speech audio file, install transformers 4.44.1 and 4.25.1 and check the output.
In case you have problems with 4.25.1 due to jax and jaxlib versions, remove version checkers Exceptions and change versions in the setup.py to:

"jax>=0.4",
"jaxlib>=0.4.10",

Audio used for this example: link

Code to reproduce the issue

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load the processor and model
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")

# Load and preprocess the audio file
audio_path = "audio.mp3"
audio, sr = librosa.load(audio_path, sr=16000)  # Ensure the sample rate is 16kHz

# Preprocess the audio to get the input features
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate the transcription using Beam Search with the model
beam_outputs = model.generate(
    inputs["input_features"],
    num_beams=5,  # Number of beams
    num_return_sequences=5,  # Number of hypotheses to return
    early_stopping=True,
    output_scores=True, 
    return_dict_in_generate=True,
)

# Decode the generated transcriptions
hypotheses = [processor.decode(output_ids, skip_special_tokens=True) for output_ids in beam_outputs.sequences]

# Print out the hypotheses
for i, hypothesis in enumerate(hypotheses):
    print(f"Hypothesis {i + 1}: {hypothesis}. Score: {beam_outputs.sequences_scores[i]}")

Output before the fix:

TypeError: 'NoneType' object is not subscriptable

Expected output:

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261

Solves #32373 issue

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

PS: I am not sure that Beam Search behaviour is working as expected. The Beam Search algorithm in Transformers version 4.25.1 looks more natural to me. At least there is more diversity in beams, and they are not just repeated. I assume there is another bug in the Beam Search algorithm in the current version. But I am not confident enough to say that. Could you please review the output of the current version above and the output of the older version below and tell me if that is an expected behaviour @patrickvonplaten, @zucchini-nlp and @gante?

Output of the same code above, but with Transformers version 4.25.1, where hypotheses are more diverse (which is expected due to topk sampling). Also, scores are less similar in comparison to Transformers version 4.44.1.

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

gante · 2024-08-23T09:00:22Z

@Nik-Kras Thank you for opening this issue 🤗

Regarding beam search: the core method was untouched, except for bug fixes on edge cases. In other words, it is possible but unlikely, our integration tests should have caught noticeable differences in quality 🤗 We did change the input preparation for generate in Whisper, which might explain the changes you're seeing.

If you can get us a snippet where the outputs are different between an old version and the current version, that could help us pin the potential bug! 🐛

(passing the return questions in the issue along to @sanchit-gandhi / @kamilakesbi , as this is a Whisper-specific question :) )

Nik-Kras · 2024-08-23T09:15:12Z

@Nik-Kras Thank you for opening this issue 🤗

Happy to contribute!

If you can get us a snippet where the outputs are different between an old version and the current version, that could help us pin the potential bug! 🐛

Above, I attached a code snippet to generate 5 hypotheses with beam search and print them along with sequences_scores.

Here is the output for Transformers 4.25.1

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

Here is the output for Transformers 4.44.1 (after my fix to return sequences_scores)

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261

I only study this topic, but according to articles that explain Beam Search:

For each step, you select top num_beams tokens instead of a greedy selection of top-1 tokens. Therefore, you should select num_beams of different tokens. And while they could converge for some part of the sequence (if other sequences have too low scores), in the end, they still must be different according to the algorithm. And I am not sure the update in input data could change it.

Tell me if I misunderstood Beam Search, but I've seen @patrickvonplaten explaining it with top-k selection, which would result in different tokens. Couldn't find the source, unfortunately.

gante

(the changes lgtm, but let's get the green light from one of our audio folks as well 🤗 )

gante · 2024-08-23T10:30:13Z

@Nik-Kras ah, an important detail: Whisper has a more complex generation algorithm, which goes beyond the traditional beam search (i.e. it has many heuristics on top) 🤗 Have a look at the original paper, starting with Table 7

Nik-Kras · 2024-08-23T10:36:23Z

@gante Thanks for sharing! I really appreciate your assistance, as it helps me to learn more!

I was sure that even if Whisper has additional algorithms on top of Beam Search, the results over different versions of Transformers should've been similar, if not the same.
Do you think that Whisper's deviation from the original Beam Search implementation could be a valid explanation of the behaviour above?

(I am just learning, so might be wrong)

gante · 2024-08-23T10:56:18Z

@Nik-Kras

the results over different versions of Transformers should've been similar

This is a valid argument that we're trying to honor in general! The exceptions are a) bug fixes b) clear performance upgrades (quality or throughput). The changes for Whisper would fall in clear performance upgrades -- see the original PR for WER upgrades.

Do you think that Whisper's deviation from the original Beam Search implementation could be a valid explanation of the behaviour above?

Certainly! For instance, the modification to generate with a fallback suggested in the paper explicitly asks to regenerate the output in certain conditions. It also overwrites a few parameters in some conditions, including the number of beams for beam search.

dengchengxifrank · 2024-08-26T07:59:48Z

@Nik-Kras I wonder what exactly the "scores", and "sequences_scores" mean. I also wonder if I pass output_logits=True to the generate method, where can I check the specific operation process. For example,if my lable is [1,12,3],and I get the logits, how can I use logits to calculate the CrossEntropy Loss. Thanks!

Nik-Kras · 2024-08-26T14:22:34Z

I believe you are trying to find the source to help with your CrossEntropy issue; this Pull Request is quite different; it focuses on the Beam Search methodology of generating sequences. Read more about it here:

@Nik-Kras I wonder what exactly the "scores", and "sequences_scores" mean.

Below is my personal reasoning
Scores are kind of logits used to select the next token for each beam in the beam search. Sequences_scores is a kind of rating of each generated sequence that helps to select the best one. It is commonly calculated by multiplying the probabilities of each selected token in the beam. So, if the generated sequence consists of low-probability words, there is probably something wrong with the sequence, and it shall not be selected. Sequence Score simplifies your selection by comparing one number to the other.

dengchengxifrank · 2024-08-26T17:24:47Z

@Nik-Kras Thanks for the documents. I can get the logits but as for each timestep in beam search, I wonder which index the beam search algorithm picks (Sometimes not the largest one). I want to calculate the CrossEntropy Loss at each timestep as a "metric" and finally get the N-Best results. Or may I change my problems as: how can I get the N-best results with scores? I think it is inappropriate for me to ask questions under this pull request. Would it be better to ask such questions in issues? Thanks！！！

LysandreJik · 2024-08-27T09:08:20Z

cc @eustlb if you have some bandwidth to take a look at this while @ylacombe is off!

ylacombe

Hey @Nik-Kras, thanks for opening this PR!

You get your reasoning right, and we're indeed missing sequence_scores. The fix seems great to me!

Do you think:

You could also take care of beam_indices as indicated in Some Whisper beam search output (sequences_scores, etc.) is lost in _stack_split_outputs #32373 ?
Add a test here to make sure the error doesn't happen again ?

Regarding the change in beam search behaviour, let's open a subsequent issue about it, once your PR is merged! I believe it might be solved by #32336, don't hesitate to try it out and let us know if it solved your issue

cc @eustlb for visibility

ylacombe · 2024-09-03T10:12:44Z

Also, how long is your input audio ?

Nik-Kras · 2024-09-04T20:32:26Z

@ylacombe The audio is 7 seconds long. It is taken from Common Voice 17 and you can find it by ground truth transcription "How is Mozilla going to handle ambiguities like queue and cue?". I did put a link to the audio from HuggingFace, but it seems like it is broken. Trying again link

I am a bit distracted by other projects at the moment, but the test should be very easy to write, I will add it soon.
I am not so confident about beam_indices as I remember there are several bugs / issues around it. It would take some time, so I would leave it for another PR

ylacombe · 2024-09-05T09:35:16Z

Hey @Nik-Kras, thanks for circling back on this!
Whisper is used by many users, so it'd be great to have the test added soon!
If you don't have the bandwidth in the next coming days, let us know and we can supersede the PR to add the test and beam_indices - while keeping you as co-author of course!

Nik-Kras · 2024-09-06T09:22:53Z

@ylacombe I will allocate some time early next week, if that is fine with you

Nik-Kras · 2024-09-10T11:58:12Z

@ylacombe All done. It was an easy fix; the most time I spent was setting it up :)

However, I would want to get your attention on a side issue I found - the inconsistency of Whisper's generation over different versions. I explained the comparison above, mentioning specific versions and my observations.
To fix the beam_indicies, I compared the output with version 4.25.1
The type and shape of it is the same, but the data is significantly different

I see that sequences_scores and beam_indicies are not just computed differently. I strongly believe they are computed wrong. I don't expect Whisper to give 5 identical beams due to the beam search (with identical scores as well). To perform LLM post-correction, I need transcription variants, which is a desired behaviour.

Please have a look at the changes in how the output is generated. I am unsure how to spot a bug there, but I strongly believe there is one.

ylacombe

Thanks for the fix @Nik-Kras! LGTM! You should do make fixup to correct your code quality as indicated here.

Regarding the change of behaviour, this is indeed really concerning. it might be related to #32246, which I've opened a PR for here: #32336

I'll try using #32336 on top of your PR to see if it solves your issue. I'll let you know how it goes.

If it doesn't solve the regression, then let's open another issue with a reproducer and the expected results!

Nik-Kras · 2024-09-12T09:36:05Z

@ylacombe Here are results of an experiment.
Correct transcription: "How is Mozilla going to handle ambiguities like queue and cue?"
Audio is taken from Common Voice 17 dataset

code

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load the processor and model
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")

# Load and preprocess the audio file
audio_path = "audio.mp3"
audio, sr = librosa.load(audio_path, sr=16000)  # Ensure the sample rate is 16kHz

# Preprocess the audio to get the input features
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate the transcription using Beam Search with the model
beam_outputs = model.generate(
    inputs["input_features"],
    num_beams=5,  # Number of beams
    num_return_sequences=5,  # Number of hypotheses to return
    early_stopping=True,
    output_scores=True, 
    return_dict_in_generate=True,
)

# Decode the generated transcriptions
hypotheses = [processor.decode(output_ids, skip_special_tokens=True) for output_ids in beam_outputs.sequences]

# Print out the hypotheses
for i, hypothesis in enumerate(hypotheses):
    print(f"Hypothesis {i + 1}: {hypothesis}. Score: {beam_outputs.sequences_scores[i]}")

transformers v4.25.1

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

transformers v4.44.1 + My Fix

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495038032531738
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495040416717529
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036840438843
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036244392395
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495033264160156

transformers v4.44.1 + My Fix + Your Fix

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495038032531738
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495040416717529
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036840438843
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036244392395
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495033264160156

There is something else. The core algorithm of beam search seems incorrect. The algorithm should NOT select the same tokens for different beams, that is the point of top_k sampling in the beam search.
I guess another issue should be opened

ylacombe · 2024-09-12T09:45:08Z

Nik-Kras, thanks for taking the time to test it. I have the same results on my side.

I managed to identified why this is happening. It's been introduced in #30984 and is happening because _expand_variables_for_generation artificially expands the batch size to num_return_sequences. When it's then passed to GenerationMixin.generate and since batch_size=5, the model generates batch_size*num_beams beams but only keeps the most probable of them for each element of the batch.

In other words, num_return_sequences is not compatible with short-form and long-form generation anymore. Now that I've identified the issue, I'll think about how to best solve it. Happy to get your thoughts here!
Also cc @eustlb

PS: don't forget to do make fixup to correct your code quality as indicated here. Otherwise, your PR won't be merged

ylacombe · 2024-09-12T10:11:39Z

Thanks for iterating! LGTM !

Would you like to open another issue for the problem you've identified?

HuggingFaceDocBuilderDev · 2024-09-12T10:28:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for fixing and adding a test!

…gingface#32970) * added sequences_scores to the output * added beam_indices to output * added test to check for beam_indices, sequences_scores and their shape * removed redundant whitespaces * make fixup

added sequences_scores to the output

fb9cf6a

gante approved these changes Aug 23, 2024

View reviewed changes

ylacombe reviewed Sep 3, 2024

View reviewed changes

Nik-Kras added 2 commits September 10, 2024 11:49

added beam_indices to output

c1b7b9f

added test to check for beam_indices, sequences_scores and their shape

ad3e341

removed redundant whitespaces

f02ba02

Nik-Kras requested a review from ylacombe September 10, 2024 12:05

ylacombe approved these changes Sep 12, 2024

View reviewed changes

make fixup

1d00f7e

ylacombe requested a review from LysandreJik September 12, 2024 10:10

Nik-Kras mentioned this pull request Sep 12, 2024

Whisper Beam Search doesn't work #33445

Open

4 tasks

ylacombe requested a review from amyeroberts September 17, 2024 14:53

amyeroberts approved these changes Sep 17, 2024

View reviewed changes

amyeroberts merged commit c29a869 into huggingface:main Sep 17, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix missing `sequences_scores` in the Whisper beam search output #32970

Fix missing `sequences_scores` in the Whisper beam search output #32970

Nik-Kras commented Aug 22, 2024

gante commented Aug 23, 2024 •

edited

Loading

Nik-Kras commented Aug 23, 2024 •

edited

Loading

gante left a comment

gante commented Aug 23, 2024

Nik-Kras commented Aug 23, 2024

gante commented Aug 23, 2024 •

edited

Loading

dengchengxifrank commented Aug 26, 2024

Nik-Kras commented Aug 26, 2024

dengchengxifrank commented Aug 26, 2024

LysandreJik commented Aug 27, 2024

ylacombe left a comment •

edited

Loading

ylacombe commented Sep 3, 2024

Nik-Kras commented Sep 4, 2024

ylacombe commented Sep 5, 2024

Nik-Kras commented Sep 6, 2024

Nik-Kras commented Sep 10, 2024 •

edited

Loading

ylacombe left a comment •

edited

Loading

Nik-Kras commented Sep 12, 2024

ylacombe commented Sep 12, 2024 •

edited

Loading

ylacombe commented Sep 12, 2024

HuggingFaceDocBuilderDev commented Sep 12, 2024

amyeroberts left a comment

Fix missing sequences_scores in the Whisper beam search output #32970

Fix missing sequences_scores in the Whisper beam search output #32970

Conversation

Nik-Kras commented Aug 22, 2024

What does this PR do?

Before submitting

Who can review?

gante commented Aug 23, 2024 • edited Loading

Nik-Kras commented Aug 23, 2024 • edited Loading

gante left a comment

Choose a reason for hiding this comment

gante commented Aug 23, 2024

Nik-Kras commented Aug 23, 2024

gante commented Aug 23, 2024 • edited Loading

dengchengxifrank commented Aug 26, 2024

Nik-Kras commented Aug 26, 2024

dengchengxifrank commented Aug 26, 2024

LysandreJik commented Aug 27, 2024

ylacombe left a comment • edited Loading

Choose a reason for hiding this comment

ylacombe commented Sep 3, 2024

Nik-Kras commented Sep 4, 2024

ylacombe commented Sep 5, 2024

Nik-Kras commented Sep 6, 2024

Nik-Kras commented Sep 10, 2024 • edited Loading

ylacombe left a comment • edited Loading

Choose a reason for hiding this comment

Nik-Kras commented Sep 12, 2024

ylacombe commented Sep 12, 2024 • edited Loading

ylacombe commented Sep 12, 2024

HuggingFaceDocBuilderDev commented Sep 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Fix missing `sequences_scores` in the Whisper beam search output #32970

Fix missing `sequences_scores` in the Whisper beam search output #32970

gante commented Aug 23, 2024 •

edited

Loading

Nik-Kras commented Aug 23, 2024 •

edited

Loading

gante commented Aug 23, 2024 •

edited

Loading

ylacombe left a comment •

edited

Loading

Nik-Kras commented Sep 10, 2024 •

edited

Loading

ylacombe left a comment •

edited

Loading

ylacombe commented Sep 12, 2024 •

edited

Loading