Speech2TextForConditionalGeneration broken in transformers 4.51.x

### System Info

- `transformers` version: 4.51.3
- Platform: macOS-15.3.1-arm64-arm-64bit
- Python version: 3.12.9
- Huggingface_hub version: 0.30.2
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.7.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

Running the Speech2Text example on transformers 4.51.x gives either nonsense output or no output. The code I'm running is taken verbatim from https://huggingface.co/docs/transformers/en/model_doc/speech_to_text

```python
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")


ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(inputs["input_features"], attention_mask=inputs["attention_mask"])

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
transcription
```

On transformers 4.50.3 it gives the expected output:

```python
['mister quilter is the apostle of the middle classes and we are glad to welcome his gospel']
```

On transformers 4.51.x it gives either no output or nonsense output:

With Python 3.12 & transformers 4.51.3:
```python
['that man man man man man man man man man man man man turn turn turn turn turn turn turn turn turn thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin thin']
```

With Python 3.9 & transformers 4.51.3:
```python
['']
```


### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

`conda create --name temp python=3.12`
`conda activate temp`
`pip install torch torchaudio soundfile librosa datasets transformers sentencepiece`

```python
import torch
from transformers import Speech2TextProcessor, Speech2TextForConditionalGeneration
from datasets import load_dataset

model = Speech2TextForConditionalGeneration.from_pretrained("facebook/s2t-small-librispeech-asr")
processor = Speech2TextProcessor.from_pretrained("facebook/s2t-small-librispeech-asr")


ds = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")

inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(inputs["input_features"], attention_mask=inputs["attention_mask"])

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
transcription
```

### Expected behavior

```python
['mister quilter is the apostle of the middle classes and we are glad to welcome his gospel']
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speech2TextForConditionalGeneration broken in transformers 4.51.x #37874

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speech2TextForConditionalGeneration broken in transformers 4.51.x #37874

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions