[Bug]: FlashInfer Sampler is broken on nightly vLLM

### Your current environment

```
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
tcpxResult_t tcpxInit(tcpxDebugLogger_t):379 NET/GPUDirectTCPX : GPUDirectTCPX enable: 1Warning: please use at least NVCC 12.9 for the best DeepGEMM performancetcpxResult_t tcpxInit(tcpxDebugLogger_t):379 NET/GPUDirectTCPX : GPUDirectTCPX enable: 1Warning: please use at least NVCC 12.9 for the best DeepGEMM performancetcpxResult_t tcpxInit(tcpxDebugLogger_t):379 NET/GPUDirectTCPX : GPUDirectTCPX enable: 1Warning: please use at least NVCC 12.9 for the best DeepGEMM performancetcpxResult_t tcpxInit(tcpxDebugLogger_t):379 NET/GPUDirectTCPX : GPUDirectTCPX enable: 1Warning: please use at least NVCC 12.9 for the best DeepGEMM performance(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1726, in execute_model
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] RuntimeError: CUDA error: device-side assert triggered
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1726, in execute_model
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]     valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] RuntimeError: CUDA error: device-side assert triggered
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP0 pid=406) ERROR 08-16 01:07:56 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
```

### 🐛 Describe the bug

Running Qwen3MoE with FI sampler. This error goes away when export VLLM_USE_FLASHINFER_SAMPLER=0


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: FlashInfer Sampler is broken on nightly vLLM #23023

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: FlashInfer Sampler is broken on nightly vLLM #23023

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions