Whisper Beam Search doesn't work #33445

Nik-Kras · 2024-09-12T10:35:43Z

System Info

- `transformers` version: 4.45.0.dev0
- Platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Huggingface_hub version: 0.24.7
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.1+cu121 (True)
- Tensorflow version (GPU?): 2.15.1 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA GeForce GTX 1070 Ti

Who can help?

@ylacombe @eustlb

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Download an audio sample https://drive.google.com/file/d/1eVeFUyfHWMpmFSRYxmBWaNe_JLEQqT8G/view?usp=sharing
Use transformers v4.41 + my fix from Fix missing sequences_scores in the Whisper beam search output #32970 (it allows to output sequence_score)
Run the code below to get 5 hypotheses of Beam Search on audio transcription

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load the processor and model
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")

# Load and preprocess the audio file
audio_path = "audio.mp3"
audio, sr = librosa.load(audio_path, sr=16000)  # Ensure the sample rate is 16kHz

# Preprocess the audio to get the input features
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate the transcription using Beam Search with the model
beam_outputs = model.generate(
    inputs["input_features"],
    num_beams=5,  # Number of beams
    num_return_sequences=5,  # Number of hypotheses to return
    early_stopping=True,
    output_scores=True, 
    return_dict_in_generate=True,
)

# Decode the generated transcriptions
hypotheses = [processor.decode(output_ids, skip_special_tokens=True) for output_ids in beam_outputs.sequences]

# Print out the hypotheses
for i, hypothesis in enumerate(hypotheses):
    print(f"Hypothesis {i + 1}: {hypothesis}. Score: {beam_outputs.sequences_scores[i]}")

Expected behavior

Together with @ylacombe we identified that after Pull Request #30984 Whisper Beam Search generation doesn't work as intended.
See more detailed discussion on Pull Request #32970

The code above must return 5 unique hypotheses due to the core principle of the Beam Search - to select num_beams best tokens in a top_k sampling fashion. Instead, we are getting the same results with the highest probability. See below for how Beam Search used to work in version v4.25.1 and how it works now.

transformers v4.25.1

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

transformers v4.44.1 + My Fix from #32970

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495038032531738
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495040416717529
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036840438843
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036244392395
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495033264160156

@ylacombe has found the bug in _expand_variables_for_generation function.

The function artificially expands the batch size to num_return_sequences, which causes an issue when this expanded batch size is passed to GenerationMixin.generate. Specifically, if batch_size=5 and num_return_sequences > 1, the model generates batch_size * num_beams beams but retains only the most probable beam for each element of the original batch.

Impact

This bug results in the num_return_sequences parameter not being compatible with both short-form and long-form generation. Users expecting multiple return sequences will only receive the most probable sequence, which may not meet the intended use case.

cc @eustlb

The text was updated successfully, but these errors were encountered:

github-actions · 2024-10-13T08:03:36Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nik-Kras · 2024-10-14T09:09:44Z

Commenting on this issue to keep it active as I believe it is important to resolve it.
Also, cc'd related developers @gante @ylacombe

eustlb · 2024-10-14T10:09:13Z

Hey @Nik-Kras, indeed that's an important issue to solve! We are working on multiple fixes on Transformers Whisper integration (see #34135, #34111 and #33512). This issue requires a bug on the avg_logprob computation (pretty simple, the removal of the eos_token_id resulted in unaligned scores and tokens) to be first solved, and then a design choice on how num_return_sequences should be handled with long-form generation. Indeed, in sequential long-form decoding, we have multiple calls to generate where the beam search will return the best num_return_sequences, nevertheless for the next call of generate for the next 30sec window, we won't be able to continue a correct beam_search. My take is that we could do a partial beam search, where we'll return the concatenated result of each call to generate with num_return_sequences. This should come as soon as the previously cited fixes are merged!

Nik-Kras added the bug label Sep 12, 2024

LysandreJik added Generation Audio labels Sep 17, 2024

github-actions bot closed this as completed Nov 16, 2024

LysandreJik reopened this Nov 19, 2024

huggingface deleted a comment from github-actions bot Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Beam Search doesn't work #33445

Whisper Beam Search doesn't work #33445

Nik-Kras commented Sep 12, 2024

github-actions bot commented Oct 13, 2024

Nik-Kras commented Oct 14, 2024

eustlb commented Oct 14, 2024

Whisper Beam Search doesn't work #33445

Whisper Beam Search doesn't work #33445

Comments

Nik-Kras commented Sep 12, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Impact

github-actions bot commented Oct 13, 2024

Nik-Kras commented Oct 14, 2024

eustlb commented Oct 14, 2024