[GPU] some fixes and optimizations to CM PA and XAttention kernels #33454

ceciliapeng2011 · 2026-01-04T03:17:53Z

Improve KVCache quantization, XAttention flexibility, and sparse attention performance.

Details:

Use float as internal precision for KVCache quantization in kvcache_update CM kernel to fix accuracy issues in QWen3-32B int8 model.
Remove restriction in PA 2nd token CM kernel that limited heads_num / kv_heads_num <= 8, resolving MiniCPM4 failure.
Fix phi-3-mini-128k-instruct issue caused by head_size=96 not divisible by 64 in xattention_gemm_qk kernel.
Support k/v head_size 96 in kvcache_update kernel
Root cause and fix issue that even with xattn_threshold 100, for both i8/fp16 kvcache, first token of MiniCPM4 still wrong.

Tickets:

CVS-178816
CVS-178638

…ed with float precision to avoid an onverflow zp.

src/plugins/intel_gpu/src/graph/impls/cm/pa_single_token.cm

…le by 64 in xattention_gemm_qk kernel.

…ta_func().

fix QWen3-32B int8 model accuracy issue: scale_val should be calculat…

d43b10b

…ed with float precision to avoid an onverflow zp.

ceciliapeng2011 requested review from a team as code owners January 4, 2026 03:17

github-actions bot added the category: GPU OpenVINO GPU plugin label Jan 4, 2026

ceciliapeng2011 marked this pull request as draft January 4, 2026 03:18

ceciliapeng2011 requested a review from riverlijunjie January 4, 2026 03:18

ceciliapeng2011 changed the title ~~fix QWen3-32B int8 model accuracy issue: scale_val should be calculat…~~ [GPU] some fixes and optimizations to CM PA and XAttention kernels Jan 4, 2026

riverlijunjie approved these changes Jan 4, 2026

View reviewed changes

CM_PA 2nd token: support arbitray ratio of num_q_heads/num_kv_heads.

aa6cd82

ceciliapeng2011 requested a review from riverlijunjie January 8, 2026 07:04

riverlijunjie reviewed Jan 9, 2026

View reviewed changes

src/plugins/intel_gpu/src/graph/impls/cm/pa_single_token.cm Show resolved Hide resolved

Fix phi-3-mini-128k-instruct issue caused by head_size=96 not divisib…

c4bb045

…le by 64 in xattention_gemm_qk kernel.

ceciliapeng2011 marked this pull request as ready for review January 9, 2026 09:01

update new gws to PagedAttentionGeneratorSingleToken::get_dispatch_da…

af4f37e

…ta_func().

ceciliapeng2011 requested a review from riverlijunjie January 9, 2026 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GPU] some fixes and optimizations to CM PA and XAttention kernels #33454

[GPU] some fixes and optimizations to CM PA and XAttention kernels #33454

ceciliapeng2011 commented Jan 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[GPU] some fixes and optimizations to CM PA and XAttention kernels #33454

Are you sure you want to change the base?

[GPU] some fixes and optimizations to CM PA and XAttention kernels #33454

Conversation

ceciliapeng2011 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ceciliapeng2011 commented Jan 4, 2026 •

edited

Loading