Skip to content

Commit 55211b0

Browse files
authored
[Bugfix] Fix chunked prefill for GGUF (#14666)
Signed-off-by: SzymonOzog <szymon.ozog@aleph-alpha.com>
1 parent 5d043c1 commit 55211b0

File tree

1 file changed

+7
-0
lines changed
  • vllm/model_executor/layers/quantization

1 file changed

+7
-0
lines changed

vllm/model_executor/layers/quantization/gguf.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,13 @@ def get_quant_method(self, layer: torch.nn.Module,
9898

9999
def _fuse_mul_mat(x: torch.Tensor, qweight: torch.Tensor,
100100
qweight_type: int) -> torch.Tensor:
101+
# HACK: when doing chunked prefill we don't generate output tokens
102+
# so input to logits generator is empty which causes invalid parameter
103+
if x.shape[0] == 0:
104+
return torch.empty(x.shape[0],
105+
qweight.shape[0],
106+
dtype=x.dtype,
107+
device=x.device)
101108
# there is no need to call any kernel for fp16/bf16
102109
if qweight_type in UNQUANTIZED_TYPES:
103110
return x @ qweight.T

0 commit comments

Comments
 (0)