mx: small speedup with dim0 cast #1980

vkuzo · 2025-03-28T19:55:16Z

Summary:

Removes the unnecessary cast to bfloat16 in the MX dim0 casting code.
This is a 2.6% speedup on 16k by 16k shape:
https://www.internalfb.com/phabricator/paste/view/P1769373804

Note: this PR also includes a couple of cleanups around e8m0 dtype and
NaN handling, I found them while coding this PR. Leaving them together
instead of
separate PR since they are all safe.

Test Plan:

(pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.8.0a0+git25309a1
triton version: 3.3.0
mode: dim0_mx
time_us 152.90741052631583
mem_bw_gbps 5321.488168553876
(pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$
(pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$
(pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384
M 16384 K 16384 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.8.0a0+git25309a1
triton version: 3.3.0
mode: dim0_mx
time_us 149.03950980392162
mem_bw_gbps 5459.5924065404415

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

Summary: Removes the unnecessary cast to bfloat16 in the MX dim0 casting code. This is a 2.6% speedup on 16k by 16k shape: https://www.internalfb.com/phabricator/paste/view/P1769373804 Note: this PR also includes a couple of cleanups around e8m0 dtype and NaN handling, I found them while coding this PR. Leaving them together instead of separate PR since they are all safe. Test Plan: ```bash (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 152.90741052631583 mem_bw_gbps 5321.488168553876 (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 149.03950980392162 mem_bw_gbps 5459.5924065404415 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 47fb1df ghstack-comment-id: 2762318741 Pull Request resolved: #1980

[ghstack-poisoned]

Summary: Removes the unnecessary cast to bfloat16 in the MX dim0 casting code. This is a 2.6% speedup on 16k by 16k shape: https://www.internalfb.com/phabricator/paste/view/P1769373804 Note: this PR also includes a couple of cleanups around e8m0 dtype and NaN handling, I found them while coding this PR. Leaving them together instead of separate PR since they are all safe. Test Plan: ```bash (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 152.90741052631583 mem_bw_gbps 5321.488168553876 (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 149.03950980392162 mem_bw_gbps 5459.5924065404415 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 47fb1df ghstack-comment-id: 2762318741 Pull Request resolved: #1980

[ghstack-poisoned]

Summary: Removes the unnecessary cast to bfloat16 in the MX dim0 casting code. This is a 2.6% speedup on 16k by 16k shape: https://www.internalfb.com/phabricator/paste/view/P1769373804 Note: this PR also includes a couple of cleanups around e8m0 dtype and NaN handling, I found them while coding this PR. Leaving them together instead of separate PR since they are all safe. Test Plan: ```bash (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 152.90741052631583 mem_bw_gbps 5321.488168553876 (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 149.03950980392162 mem_bw_gbps 5459.5924065404415 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 47fb1df ghstack-comment-id: 2762318741 Pull Request resolved: #1980

[ghstack-poisoned]

Summary: Removes the unnecessary cast to bfloat16 in the MX dim0 casting code. This is a 2.6% speedup on 16k by 16k shape: https://www.internalfb.com/phabricator/paste/view/P1769373804 Note: this PR also includes a couple of cleanups around e8m0 dtype and NaN handling, I found them while coding this PR. Leaving them together instead of separate PR since they are all safe. Test Plan: ```bash (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 152.90741052631583 mem_bw_gbps 5321.488168553876 (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 149.03950980392162 mem_bw_gbps 5459.5924065404415 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 47fb1df ghstack-comment-id: 2762318741 Pull Request resolved: #1980

[ghstack-poisoned]

Summary: Removes the unnecessary cast to bfloat16 in the MX dim0 casting code. This is a 2.6% speedup on 16k by 16k shape: https://www.internalfb.com/phabricator/paste/view/P1769373804 Note: this PR also includes a couple of cleanups around e8m0 dtype and NaN handling, I found them while coding this PR. Leaving them together instead of separate PR since they are all safe. Test Plan: ```bash (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 152.90741052631583 mem_bw_gbps 5321.488168553876 (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 149.03950980392162 mem_bw_gbps 5459.5924065404415 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 47fb1df ghstack-comment-id: 2762318741 Pull Request resolved: #1980

[ghstack-poisoned]

Summary: Removes the unnecessary cast to bfloat16 in the MX dim0 casting code. This is a 2.6% speedup on 16k by 16k shape: https://www.internalfb.com/phabricator/paste/view/P1769373804 Note: this PR also includes a couple of cleanups around e8m0 dtype and NaN handling, I found them while coding this PR. Leaving them together instead of separate PR since they are all safe. Test Plan: ```bash (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 152.90741052631583 mem_bw_gbps 5321.488168553876 (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ (pytorch) [vasiliy@devgpu023.atn1 ~/local/ao (20250321_mx_dim1_triton_kernel)]$ python benchmarks/mx_formats/cast_bench.py --mode dim0_mx --M 16384 --K 16384 M 16384 K 16384 BLOCK_SIZE 32 GPU: NVIDIA B200 torch version: 2.8.0a0+git25309a1 triton version: 3.3.0 mode: dim0_mx time_us 149.03950980392162 mem_bw_gbps 5459.5924065404415 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 47fb1df ghstack-comment-id: 2762318741 Pull Request resolved: #1980

vkuzo added 30 commits March 21, 2025 06:59

Update

af6ae2f

[ghstack-poisoned]

Update

45120de

[ghstack-poisoned]

Update

5527e72

[ghstack-poisoned]

Update

478b9e1

[ghstack-poisoned]

Update

571775d

[ghstack-poisoned]

Update

fd30558

[ghstack-poisoned]

Update

b0cd056

[ghstack-poisoned]

Update

26b49fd

[ghstack-poisoned]

Update

ba10a02

[ghstack-poisoned]

Update

483cdfd

[ghstack-poisoned]

Update

32005c9

[ghstack-poisoned]

Update

e341c2e

[ghstack-poisoned]

Update

7ecd79f

[ghstack-poisoned]

Update

ca3c4cf

[ghstack-poisoned]

Update

0de11cf

[ghstack-poisoned]

Update

912e4dc

[ghstack-poisoned]

Update

fb5662a

[ghstack-poisoned]

Update

f245d64

[ghstack-poisoned]

Update

9e5b8f8

[ghstack-poisoned]

Update

e5bdecb

[ghstack-poisoned]

Update

4c2ad8c

[ghstack-poisoned]

Update

c1ceef1

[ghstack-poisoned]

Update

65bfff0

[ghstack-poisoned]

Update

0ff3a93

[ghstack-poisoned]

Update

71a5548

[ghstack-poisoned]

Update

0576d0d

[ghstack-poisoned]

Update

f98453f

[ghstack-poisoned]

Update

81dc214

[ghstack-poisoned]

Update

5d60f24

[ghstack-poisoned]

Update

a313055

[ghstack-poisoned]

vkuzo added 5 commits March 28, 2025 13:00

Update

d1bf83a

[ghstack-poisoned]

Update

84c77d7

[ghstack-poisoned]

Update

749564b

[ghstack-poisoned]

Update

f63479e

[ghstack-poisoned]

Update

8b0f250

[ghstack-poisoned]

vkuzo added 5 commits March 28, 2025 13:02

Update

c603f09

[ghstack-poisoned]

Update

42fb0e9

[ghstack-poisoned]

Update

a16f576

[ghstack-poisoned]

Update

b890654

[ghstack-poisoned]

Update

35179ab

[ghstack-poisoned]

vkuzo added 4 commits March 28, 2025 13:03

Update

83e1e2e

[ghstack-poisoned]

Update

c41cc19

[ghstack-poisoned]

Update

8ba3018

[ghstack-poisoned]

Update

f437e00

[ghstack-poisoned]

vkuzo added 3 commits March 28, 2025 13:03

Update

a62e00b

[ghstack-poisoned]

Update

195d904

[ghstack-poisoned]

Update

bdb0996

[ghstack-poisoned]

HDCharles approved these changes Mar 28, 2025

View reviewed changes

vkuzo added 2 commits April 1, 2025 09:40

Update

8a5050e

[ghstack-poisoned]

Update

635ca65

[ghstack-poisoned]

Update

0d5b763

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/85/head to main April 1, 2025 16:41

vkuzo merged commit aafc1ba into main Apr 1, 2025
46 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mx: small speedup with dim0 cast #1980

mx: small speedup with dim0 cast #1980

Uh oh!

vkuzo commented Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!

mx: small speedup with dim0 cast #1980

mx: small speedup with dim0 cast #1980

Uh oh!

Conversation

vkuzo commented Mar 28, 2025

Uh oh!

Uh oh!

Uh oh!