Skip to content

Support profiler pa and reduce kernel and remove p99 to use test_common#1787

Merged
ZhangLirong-amd merged 12 commits intomainfrom
zlr/pa_pa_fix
Jan 12, 2026
Merged

Support profiler pa and reduce kernel and remove p99 to use test_common#1787
ZhangLirong-amd merged 12 commits intomainfrom
zlr/pa_pa_fix

Conversation

@ZhangLirong-amd
Copy link
Contributor

@ZhangLirong-amd ZhangLirong-amd commented Jan 8, 2026

Motivation

To better breakdown Pa kernel time and reduce kernel time, we use profile to breakdown
python op_tests/test_pa_ps.py --profile

PERFORMANCE COMPARISON
========================================================================================================================
Sequence Lengths: 90002
------------------------------------------------------------------------------------------------------------------------
Metadata (get_pa_metadata_v1)            167.64us
------------------------------------------------------------------------------------------------------------------------
Method                                 Time (us)            PA Kernel               Reduce
------------------------------------------------------------------------------------------------------------------------
PA with Persistent Scheduling           1091.04us   1074.32us ( 98.5%)     15.87us (  1.5%)
PA without Persistent Scheduling        1303.82us   1303.78us (100.0%)      0.00us (  0.0%)
------------------------------------------------------------------------------------------------------------------------
Speedup                                       1.20x
========================================================================================================================

@ZhangLirong-amd ZhangLirong-amd requested review from a team and Copilot January 8, 2026 01:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the performance profiling functionality in the PA (Paged Attention) test suite. The changes replace a percentile-based benchmarking approach with PyTorch profiler-based kernel time breakdown, allowing for more detailed analysis of PA kernel and reduce kernel execution times.

Key changes:

  • Replaces benchmark_with_percentile with profile_kernel_breakdown to provide kernel-level timing breakdown
  • Renames --use_p99 flag to --profile for clarity
  • Updates performance comparison output to show kernel-specific timing ratios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

valarLip
valarLip previously approved these changes Jan 10, 2026
dbyoung18
dbyoung18 previously approved these changes Jan 12, 2026
Copy link
Contributor

@dbyoung18 dbyoung18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dbyoung18 dbyoung18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZhangLirong-amd ZhangLirong-amd merged commit 19724aa into main Jan 12, 2026
17 checks passed
@ZhangLirong-amd ZhangLirong-amd deleted the zlr/pa_pa_fix branch January 12, 2026 14:52
zhuyuhua-v pushed a commit that referenced this pull request Jan 14, 2026
…on (#1787)

* Support profiler pa and reduce kernel and remove p99 method to use test_common

* format code

* add more info for pa

* reformat

* add bs info

* use torch schedule to wram up

* remove some feature in csv

* refactor

* test(paps): enhance paps ut

Signed-off-by: Double Young <yang.yang2@amd.com>

* fix(paps): fix format

* fix(paps): fix format

* fix(paps): fix format

---------

Signed-off-by: Double Young <yang.yang2@amd.com>
Co-authored-by: Double Young <yang.yang2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants