-
Notifications
You must be signed in to change notification settings - Fork 544
Description
Your current environment
OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.2
Libc version: glibc-2.35
Versions of relevant libraries:
[pip3] mypy==1.15.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] pyzmq==26.3.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.52.4
[conda] Could not collect
vLLM Version: 0.7.4
vLLM Ascend Version: 0.7.3.post2.dev2+gb69d41d (git sha: b69d41d)
🐛 Describe the bug
When running inference with the v0.7.3-dev model_runner.py, it uses AscendSampler which is not able to handle an empty logit tensor, necessary for chunked prefill.
Snippet of the failure code:
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/opt/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/vllm-ascend-vanilla/vllm-ascend/vllm_ascend/sample/sampler.py", line 61, in forward
[rank0]: logits = _apply_top_k_top_p_npu(logits, sampling_tensors.top_ps,
[rank0]: File "/root/vllm-ascend-vanilla/vllm-ascend/vllm_ascend/sample/sampler.py", line 126, in _apply_top_k_top_p_npu
[rank0]: top_p_mask[:, -1] = True
[rank0]: IndexError: index -1 is out of bounds for dimension 1 with size 0
In vllm_ascend/sampler/sample.py, when the input logit tensor to the function _apply_top_k_top_p_npu is empty, the following code misbehaves:
cutoff = top_k_mask.sum(dim=-1).min()
probs_sort = logits_sort.softmax(dim=-1)[:, cutoff:]
probs_sum = probs_sort.cumsum(dim=-1)
top_p_mask = probs_sum > 1 - p.unsqueeze(dim=1)
top_p_mask[:, -1] = True
The cutoff appears extremely high, and leaves probs_sort of size (0,0), causing a failure when the top_p_mask is sliced.
Please consider empty logits tensors for chunked prefill.