[Bug]: RuntimeError: Expected index [32, 1543] to be smaller than self [31, 152065] apart from dimension 1 and to be smaller size than src [32, 1543]

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

I run serving using vllm v1 engine, using qwen32B-instruct, using fp8 static quant.
vllm==0.7.2

The error reports:

```
ERROR 02-11 07:01:54 core.py:210] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 203, in run_engine_core
ERROR 02-11 07:01:54 core.py:210]     engine_core.run_busy_loop()
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 243, in run_busy_loop
ERROR 02-11 07:01:54 core.py:210]     outputs = self.step()
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 129, in step
ERROR 02-11 07:01:54 core.py:210]     output = self.model_executor.execute_model(scheduler_output)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 77, in execute_model
ERROR 02-11 07:01:54 core.py:210]     output = self.collective_rpc("execute_model",
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 02-11 07:01:54 core.py:210]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2220, in run_method
ERROR 02-11 07:01:54 core.py:210]     return func(*args, **kwargs)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-11 07:01:54 core.py:210]     return func(*args, **kwargs)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 236, in execute_model
ERROR 02-11 07:01:54 core.py:210]     output = self.model_runner.execute_model(scheduler_output)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 02-11 07:01:54 core.py:210]     return func(*args, **kwargs)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 804, in execute_model
ERROR 02-11 07:01:54 core.py:210]     sampler_output = self.model.sample(
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 505, in sample
ERROR 02-11 07:01:54 core.py:210]     next_tokens = self.sampler(logits, sampling_metadata)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 02-11 07:01:54 core.py:210]     return self._call_impl(*args, **kwargs)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 02-11 07:01:54 core.py:210]     return forward_call(*args, **kwargs)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/sample/sampler.py", line 46, in forward
ERROR 02-11 07:01:54 core.py:210]     logits = self.apply_penalties(logits, sampling_metadata)
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/sample/sampler.py", line 130, in apply_penalties
ERROR 02-11 07:01:54 core.py:210]     logits = apply_all_penalties(
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/sample/ops/penalties.py", line 42, in apply_all_penalties
ERROR 02-11 07:01:54 core.py:210]     return apply_penalties(logits, prompt_token_ids, output_tokens_t,
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/utils.py", line 44, in apply_penalties
ERROR 02-11 07:01:54 core.py:210]     _, prompt_mask = get_token_bin_counts_and_mask(prompt_tokens_tensor,
ERROR 02-11 07:01:54 core.py:210]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/utils.py", line 18, in get_token_bin_counts_and_mask
ERROR 02-11 07:01:54 core.py:210]     bin_counts.scatter_add_(1, tokens, torch.ones_like(tokens))
ERROR 02-11 07:01:54 core.py:210] RuntimeError: Expected index [32, 1543] to be smaller than self [31, 152065] apart from dimension 1 and to be smaller size than src [32, 1543]
ERROR 02-11 07:01:54 core.py:210]
C
```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: RuntimeError: Expected index [32, 1543] to be smaller than self [31, 152065] apart from dimension 1 and to be smaller size than src [32, 1543] #13076

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: RuntimeError: Expected index [32, 1543] to be smaller than self [31, 152065] apart from dimension 1 and to be smaller size than src [32, 1543] #13076

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions