[KVCACHE] Improved schedule for prefill attention #17482

krishnaraj36 · 2024-10-22T09:32:29Z

Improvements -
Added Tranpose to K for better Vectorization during Matmul. Improved Load Schedule.
Improved a bit more than 2x is most cases.
Llama-2 7B observation
-----------kernel----------------baseline----------optimized

---batch_prefill_ragged_kv------15 ms-------------7.1 ms

This PR fixes the issue addressed in the PR #17446. The correctness issue is caused by incorrect code-generation during the unroll phase. Thus, we removed the explicit unroll and noticed little to no performance degradation.

We generated OpenCL kernels extracting the generated modules by setting num_qo_heads=28 in
https://github.qualcomm.com/gpgpu/apache-tvm/blob/85e15d494d5a42360859941cbc972c4f175c3b94/tests/python/relax/test_runtime_builtin_paged_attention_kv_cache_flashinfer.py#L36
Original PR Codegen

int cur_L_3 = ((((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) / 7) + (((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) % 7) >> 31)) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]);
if (cur_L_3 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[3] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)]))))), 0, output + (((((cur_L_3 * 3584) + ((convert_int(get_group_id(1))) * 896)) + ((((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) % 7) + (7 & (((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) + 1) % 7) >> 31))) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)) + 4));
}
int cur_L_4 = ((((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) - 2147483637) / 7) - -306783377) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]);
if (cur_L_4 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[4] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)]))))), 0, output + ((((cur_L_4 * 3584) + ((convert_int(get_group_id(1))) * 896)) + (((((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + LH_start) - 2147483637) % 7) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)));
}

In the O_store block we notice large and incorrect pointer offsets were being generated during subsequent stages of unroll. This can be indirectly noted zero elements contained in the output and compute instability.

Fusing the unroll loops to unroll together doesn't seem to resolve this.

Oddly enough, the initial test case doesn't seem to trigger the issue and works as intended.

int cur_L_3 = ((((((convert_int(get_local_id(0))) >> 4) + ((LH_start + 1) >> 2)) >> 1) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]) + (convert_int(get_local_id(1))));
if (cur_L_3 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[3] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 1)]))))), 0, output + (((((cur_L_3 * 4096) + ((convert_int(get_group_id(1))) * 1024)) + (((((((convert_int(get_local_id(0))) >> 4) * 4) + (LH_start & 7)) + 1) & 7) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)) + 4));
}
int cur_L_4 = ((((((convert_int(get_local_id(0))) >> 4) + ((LH_start + 2) >> 2)) >> 1) + q_indptr[(b_idx_1 + q_indptr_elem_offset)]) + (convert_int(get_local_id(1))));
 if (cur_L_4 < q_indptr[((b_idx_1 + q_indptr_elem_offset) + 1)]) {
    vstore4((convert_half4((O_local[4] / ((float4)(d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)], d_smem[((((convert_int(get_local_id(1))) * 8) + (((convert_int(get_local_id(0))) >> 4) * 4)) + 2)]))))), 0, output + ((((cur_L_4 * 4096) + ((convert_int(get_group_id(1))) * 1024)) + (((((((convert_int(get_local_id(0))) >> 4) * 4) + (LH_start & 7)) + 2) & 7) * 128)) + (((convert_int(get_local_id(0))) & 15) * 8)));
}

Improvements Added Tranpose to K for better Vectorization during Matmul. Improved Load Schedule. Improved a bit more than 2x is most cases. Llama-2 7B observation -----------kernel----------------baseline----------optimized- ---batch_prefill_ragged_kv------15 ms-------------7.1 ms

krishnaraj36 · 2024-10-22T09:38:16Z

@MasterJH5574 @tqchen
We have fixed the issue raise in PR (#17466).
Can you please look at this PR.

MasterJH5574

Thank you @krishnaraj36 so much for the fix!

MasterJH5574 · 2024-10-22T18:44:16Z

I have also observed the “large and incorrect” pointer offset before but I didn't get time to nail down the issue. Roughly I remember it's generated by some floordiv simplification in src/tir/transforms/lower_intrin.cc.

krishnaraj36 · 2024-10-23T04:07:22Z

Thank you @krishnaraj36 so much for the fix!
@MasterJH5574
There is only one change(removed sch.unroll(xi) ) on previous commit which was reverted.

MasterJH5574 approved these changes Oct 22, 2024

View reviewed changes

srkreddy1238 merged commit e3e27f5 into apache:main Oct 28, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KVCACHE] Improved schedule for prefill attention #17482

[KVCACHE] Improved schedule for prefill attention #17482

krishnaraj36 commented Oct 22, 2024 •

edited

Loading

krishnaraj36 commented Oct 22, 2024

MasterJH5574 left a comment

MasterJH5574 commented Oct 22, 2024

krishnaraj36 commented Oct 23, 2024

[KVCACHE] Improved schedule for prefill attention #17482

[KVCACHE] Improved schedule for prefill attention #17482

Conversation

krishnaraj36 commented Oct 22, 2024 • edited Loading

krishnaraj36 commented Oct 22, 2024

MasterJH5574 left a comment

Choose a reason for hiding this comment

MasterJH5574 commented Oct 22, 2024

krishnaraj36 commented Oct 23, 2024

krishnaraj36 commented Oct 22, 2024 •

edited

Loading