-
-
Couldn't load subscription status.
- Fork 10.8k
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
8xH100
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pynvml==12.0.0
[pip3] pyzmq==26.3.0
[pip3] sentence-transformers==3.2.1
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.52.0.dev0
[pip3] transformers-stream-generator==0.0.5
[pip3] triton==3.2.0
[pip3] tritonclient==2.51.0
[pip3] vector-quantize-pytorch==1.21.2
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
[conda] nvidia-ml-py 12.570.86 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] pynvml 12.0.0 pypi_0 pypi
[conda] pyzmq 26.3.0 pypi_0 pypi
[conda] sentence-transformers 3.2.1 pypi_0 pypi
[conda] torch 2.6.0 pypi_0 pypi
[conda] torchaudio 2.6.0 pypi_0 pypi
[conda] torchvision 0.21.0 pypi_0 pypi
[conda] transformers 4.52.0.dev0 pypi_0 pypi
[conda] transformers-stream-generator 0.0.5 pypi_0 pypi
[conda] triton 3.2.0 pypi_0 pypi
[conda] tritonclient 2.51.0 pypi_0 pypi
[conda] vector-quantize-pytorch 1.21.2 pypi_0 pypi
🐛 Describe the bug
I am using benchmark_serving.py to test against Llama4 BF16 checkpoint, and as long as I make input model length > 10000, the vLLM server would crash. However, if I switch back to use FA2 by setting VLLM_FLASH_ATTN_VERSION=2, vLLM server runs fine.
vLLM server cmd
VLLM_USE_MODELSCOPE=False SAFETENSORS_FAST_GPU=1 vllm serve \
meta-llama/Llama-4-Scout-17B-16E-Instruct \
--disable-log-requests -tp 8 \
--max-num-seqs 64 \
--no-enable-prefix-caching \
--max_num_batched_tokens=80000 \
--max-model-len 30000
Benchmark cmd
python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--dataset-name random \
--max-concurrency 64 \
--num-prompts 256 \
--random-input-len 10000 \
--random-output-len 1000
Error log
[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/executor/multiproc_executor.py", line 465, in worker_busy_loop
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] output = func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/worker/gpu_worker.py", line 263, in execute_model
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] output = self.model_runner.execute_model(scheduler_output)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/worker/gpu_model_runner.py", line 1077, in execute_model
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] hidden_states = self.model(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/model_executor/models/mllama4.py", line 777, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self.language_model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/model_executor/models/llama.py", line 541, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] model_output = self.model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/compilation/decorators.py", line 245, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] model_output = self.forward(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/model_executor/models/llama.py", line 341, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] def forward(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return fn(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] raise e
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 387, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "<eval_with_key>.98", line 638, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 400, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] raise e
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/fx/graph_module.py", line 387, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "<eval_with_key>.2", line 5, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'language_model.model.layers.0.self_attn.attn'); query_2 = key_2 = value = output_1 = unified_attention_with_output = None
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1123, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._op(*args, **(kwargs or {}))
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/attention/layer.py", line 415, in unified_attention_with_output
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] self.impl.forward(self,
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/attention/backends/flash_attn.py", line 553, in forward
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] flash_attn_varlen_func(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/vllm_flash_attn/flash_attn_interface.py", line 253, in flash_attn_varlen_func
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] out, softmax_lse, _, _ = torch.ops._vllm_fa3_C.fwd(
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] File "/home/zijingliu/.conda/envs/vllm/lib/python3.12/site-packages/torch/_ops.py", line 1123, in __call__
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] return self._op(*args, **(kwargs or {}))
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470] RuntimeError: scheduler_metadata must have shape (metadata_size)
�[1;36m(VllmWorker rank=1 pid=2920557)�[0;0m ERROR 04-21 14:16:49 [multiproc_executor.py:470]
ERROR 04-21 14:16:49 [core.py:392] EngineCore encountered a fatal error.
ERROR 04-21 14:16:49 [core.py:392] Traceback (most recent call last):
ERROR 04-21 14:16:49 [core.py:392] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 383, in run_engine_core
ERROR 04-21 14:16:49 [core.py:392] engine_core.run_busy_loop()
ERROR 04-21 14:16:49 [core.py:392] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 407, in run_busy_loop
ERROR 04-21 14:16:49 [core.py:392] self._process_engine_step()
ERROR 04-21 14:16:49 [core.py:392] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 436, in _process_engine_step
ERROR 04-21 14:16:49 [core.py:392] outputs = self.step_fn()
ERROR 04-21 14:16:49 [core.py:392] ^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [core.py:392] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core.py", line 202, in step
ERROR 04-21 14:16:49 [core.py:392] output = self.model_executor.execute_model(scheduler_output)
ERROR 04-21 14:16:49 [core.py:392] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [core.py:392] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/executor/multiproc_executor.py", line 146, in execute_model
ERROR 04-21 14:16:49 [core.py:392] (output, ) = self.collective_rpc("execute_model",
ERROR 04-21 14:16:49 [core.py:392] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [core.py:392] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/executor/multiproc_executor.py", line 185, in collective_rpc
ERROR 04-21 14:16:49 [core.py:392] raise RuntimeError(
ERROR 04-21 14:16:49 [core.py:392] RuntimeError: Worker failed with error 'scheduler_metadata must have shape (metadata_size)', please check the stack trace above for the root cause
ERROR 04-21 14:16:49 [async_llm.py:386] AsyncLLM output_handler failed.
ERROR 04-21 14:16:49 [async_llm.py:386] Traceback (most recent call last):
ERROR 04-21 14:16:49 [async_llm.py:386] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/async_llm.py", line 344, in output_handler
ERROR 04-21 14:16:49 [async_llm.py:386] outputs = await engine_core.get_output_async()
ERROR 04-21 14:16:49 [async_llm.py:386] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-21 14:16:49 [async_llm.py:386] File "/data/users/zijingliu/gitrepos/liuzijing2014/vllm/vllm/v1/engine/core_client.py", line 694, in get_output_async
ERROR 04-21 14:16:49 [async_llm.py:386] raise self._format_exception(outputs) from None
ERROR 04-21 14:16:49 [async_llm.py:386] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [2910284]
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
Done