Skip to content

[Bug]: Run Llama4 Scout 16E w/ 10000 input length trigger vllm crashing, but run fine if use FA2. #16948

@liuzijing2014

Description

@liuzijing2014

Your current environment

The output of `python collect_env.py`
8xH100
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pynvml==12.0.0
[pip3] pyzmq==26.3.0
[pip3] sentence-transformers==3.2.1
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.52.0.dev0
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.2.0
[pip3] tritonclient==2.51.0
[pip3] vector-quantize-pytorch==1.21.2
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-ml-py              12.570.86                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pynvml                    12.0.0                   pypi_0    pypi
[conda] pyzmq                     26.3.0                   pypi_0    pypi
[conda] sentence-transformers     3.2.1                    pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.52.0.dev0              pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
[conda] tritonclient              2.51.0                   pypi_0    pypi
[conda] vector-quantize-pytorch   1.21.2                   pypi_0    pypi

🐛 Describe the bug

I am using benchmark_serving.py to test against Llama4 BF16 checkpoint, and as long as I make input model length > 10000, the vLLM server would crash. However, if I switch back to use FA2 by setting VLLM_FLASH_ATTN_VERSION=2, vLLM server runs fine.

vLLM server cmd

VLLM_USE_MODELSCOPE=False SAFETENSORS_FAST_GPU=1 vllm serve \
meta-llama/Llama-4-Scout-17B-16E-Instruct \
--disable-log-requests -tp 8 \
--max-num-seqs 64 \
--no-enable-prefix-caching \
--max_num_batched_tokens=80000 \
--max-model-len 30000 

Benchmark cmd

python benchmarks/benchmark_serving.py     \
--backend vllm     \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct    \
--dataset-name random  \
--max-concurrency 64 \
--num-prompts 256 \
--random-input-len 10000 \
--random-output-len 1000
Error log
[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     output = func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/worker/gpu_worker.py", line 263, in execute_model
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     output = self.model_runner.execute_model(scheduler_output)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/worker/gpu_model_runner.py", line 1077, in execute_model
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     hidden_states = self.model(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]                     ^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/model_executor/models/mllama4.py", line 777, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self.language_model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/model_executor/models/llama.py", line 541, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     model_output = self.model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/compilation/decorators.py", line 245, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     model_output = self.forward(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/model_executor/models/llama.py", line 341, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     def forward(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return fn(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     raise e
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 387, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "<eval_with_key>.98", line 638, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     raise e
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 387, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "<eval_with_key>.2", line 5, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'language_model.model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = unified_attention_with_output = None
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1123, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._op(*args, **(kwargs or {}))
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/attention/layer.py", line 415, in unified_attention_with_output
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     self.impl.forward(self,
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/attention/backends/flash_attn.py", line 553, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     flash_attn_varlen_func(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 253, in flash_attn_varlen_func
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     out, softmax_lse, _, _ = torch.ops._vllm_fa3_C.fwd(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]   File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1123, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]     return self._op(*args, **(kwargs or {}))
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] RuntimeError: scheduler_metadata must have shape (metadata_size)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]
ERROR 04-21 14:16:49 [core.py:392] EngineCore encountered a fatal error.
ERROR 04-21 14:16:49 [core.py:392] Traceback (most recent call last):
ERROR 04-21 14:16:49 [core.py:392]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 383, in run_engine_core
ERROR 04-21 14:16:49 [core.py:392]     engine_core.run_busy_loop()
ERROR 04-21 14:16:49 [core.py:392]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 407, in run_busy_loop
ERROR 04-21 14:16:49 [core.py:392]     self._process_engine_step()
ERROR 04-21 14:16:49 [core.py:392]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 436, in _process_engine_step
ERROR 04-21 14:16:49 [core.py:392]     outputs = self.step_fn()
ERROR 04-21 14:16:49 [core.py:392]               ^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [core.py:392]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 202, in step
ERROR 04-21 14:16:49 [core.py:392]     output = self.model_executor.execute_model(scheduler_output)
ERROR 04-21 14:16:49 [core.py:392]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [core.py:392]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/executor/multiproc_executor.py", line 146, in execute_model
ERROR 04-21 14:16:49 [core.py:392]     (output, ) = self.collective_rpc("execute_model",
ERROR 04-21 14:16:49 [core.py:392]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [core.py:392]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/executor/multiproc_executor.py", line 185, in collective_rpc
ERROR 04-21 14:16:49 [core.py:392]     raise RuntimeError(
ERROR 04-21 14:16:49 [core.py:392] RuntimeError: Worker failed with error 'scheduler_metadata must have shape (metadata_size)', please check the stack trace above for the root cause
ERROR 04-21 14:16:49 [async_llm.py:386] AsyncLLM output_handler failed.
ERROR 04-21 14:16:49 [async_llm.py:386] Traceback (most recent call last):
ERROR 04-21 14:16:49 [async_llm.py:386]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/async_llm.py", line 344, in output_handler
ERROR 04-21 14:16:49 [async_llm.py:386]     outputs = await engine_core.get_output_async()
ERROR 04-21 14:16:49 [async_llm.py:386]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [async_llm.py:386]   File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core_client.py", line 694, in get_output_async
ERROR 04-21 14:16:49 [async_llm.py:386]     raise self._format_exception(outputs) from None
ERROR 04-21 14:16:49 [async_llm.py:386] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [2910284]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions