Skip to content

Optimize the performance of quick_allreduce#1816

Merged
valarLip merged 1 commit intomainfrom
yanbo/quick_allreduce
Jan 12, 2026
Merged

Optimize the performance of quick_allreduce#1816
valarLip merged 1 commit intomainfrom
yanbo/quick_allreduce

Conversation

@yanboshao
Copy link
Contributor

Motivation

In the kernel, the calculation of offsets involves prolonged local HBM access.

Technical Details

The communication temporary buffer address of the peer GPU is stored in HBM. This address is constant and frequently accessed by each thread, so it is loaded into VGPRs at the beginning of the kernel.

Test Plan

Test Result

Dtype: bfloat16
Cudagraph: on
Device: Mi325 * 8

shape before optimization(us) after optimization(us) ratio
(632,5120) 39.15 32.82 16.16%
(680,5120) 40.55 34.32 15.36%

Submission Checklist

@yanboshao yanboshao requested a review from a team January 12, 2026 07:50
Copy link
Contributor

@TennyWang1223 TennyWang1223 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@valarLip valarLip merged commit ae774a3 into main Jan 12, 2026
17 checks passed
@valarLip valarLip deleted the yanbo/quick_allreduce branch January 12, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants