Layernorm optimizations #8

mawong-amd · 2024-03-22T15:38:40Z

This PR introduces performance optimizations for fused_add_rms_norm, used in various layernorms. Its primary features are the use of packed operations for FP16 and also the use of batched ("vectorized") loads and stores, reflecting the fact that this kernel is memory latency-bound. It also decreases the maximum block size to 256 to enable better simultaneous scheduling of blocks onto SIMDs for better memory latency hiding.

Other miscellaneous additions include a more general implementation of ROCm-specific wavesizes of 64 instead of CUDA's 32 for the Alford warp shuffle.

Bulk conversions (packed halfs into half2, using vectors of half2); block and warp reduce with AMD wavesize 64 (vs 32); using smaller block sizes for improved block occupancy on CUs

mawong-amd requested review from dllehr-amd and gshtras March 22, 2024 15:38

mawong-amd self-assigned this Mar 22, 2024

mawong-amd changed the base branch from integration to integration_no_fp8 March 22, 2024 16:00

mawong-amd force-pushed the layernorm_fp16_opt branch from 9097d90 to 3d266b9 Compare March 23, 2024 04:09

mawong-amd changed the base branch from integration_no_fp8 to main March 23, 2024 07:40

mawong-amd force-pushed the layernorm_fp16_opt branch from 6aa4fd1 to ce2d8a1 Compare March 23, 2024 10:51

Layernorm optimizations:

612b8cf

Bulk conversions (packed halfs into half2, using vectors of half2); block and warp reduce with AMD wavesize 64 (vs 32); using smaller block sizes for improved block occupancy on CUs

mawong-amd force-pushed the layernorm_fp16_opt branch from ce2d8a1 to 612b8cf Compare March 23, 2024 10:57

mawong-amd merged commit 2d1baac into main Mar 23, 2024
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layernorm optimizations #8

Layernorm optimizations #8

mawong-amd commented Mar 22, 2024 •

edited

Loading

Layernorm optimizations #8

Layernorm optimizations #8

Conversation

mawong-amd commented Mar 22, 2024 • edited Loading

mawong-amd commented Mar 22, 2024 •

edited

Loading