Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layernorm optimizations #8

Merged
merged 1 commit into from
Mar 23, 2024
Merged

Layernorm optimizations #8

merged 1 commit into from
Mar 23, 2024

Conversation

mawong-amd
Copy link

@mawong-amd mawong-amd commented Mar 22, 2024

This PR introduces performance optimizations for fused_add_rms_norm, used in various layernorms. Its primary features are the use of packed operations for FP16 and also the use of batched ("vectorized") loads and stores, reflecting the fact that this kernel is memory latency-bound. It also decreases the maximum block size to 256 to enable better simultaneous scheduling of blocks onto SIMDs for better memory latency hiding.

Other miscellaneous additions include a more general implementation of ROCm-specific wavesizes of 64 instead of CUDA's 32 for the Alford warp shuffle.

@mawong-amd mawong-amd self-assigned this Mar 22, 2024
@mawong-amd mawong-amd changed the base branch from integration to integration_no_fp8 March 22, 2024 16:00
@mawong-amd mawong-amd changed the base branch from integration_no_fp8 to main March 23, 2024 07:40
Bulk conversions (packed halfs into half2, using vectors of half2);
block and warp reduce with AMD wavesize 64 (vs 32);
using smaller block sizes for improved block occupancy on CUs
@mawong-amd mawong-amd merged commit 2d1baac into main Mar 23, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant