Skip to content

Conversation

@ceciliapeng2011
Copy link
Contributor

@ceciliapeng2011 ceciliapeng2011 commented Jan 4, 2026

Improve KVCache quantization, XAttention flexibility, and sparse attention performance.

Details:

  • Use float as internal precision for KVCache quantization in kvcache_update CM kernel to fix accuracy issues in QWen3-32B int8 model.
  • Remove restriction in PA 2nd token CM kernel that limited heads_num / kv_heads_num <= 8, resolving MiniCPM4 failure.
  • Fix phi-3-mini-128k-instruct issue caused by head_size=96 not divisible by 64 in xattention_gemm_qk kernel.
  • Support k/v head_size 96 in kvcache_update kernel
  • Root cause and fix issue that even with xattn_threshold 100, for both i8/fp16 kvcache, first token of MiniCPM4 still wrong.

Tickets:

…ed with float precision to avoid an onverflow zp.
@ceciliapeng2011 ceciliapeng2011 requested review from a team as code owners January 4, 2026 03:17
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 4, 2026
@ceciliapeng2011 ceciliapeng2011 marked this pull request as draft January 4, 2026 03:18
@ceciliapeng2011 ceciliapeng2011 changed the title fix QWen3-32B int8 model accuracy issue: scale_val should be calculat… [GPU] some fixes and optimizations to CM PA and XAttention kernels Jan 4, 2026
@ceciliapeng2011 ceciliapeng2011 marked this pull request as ready for review January 9, 2026 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants