working mxfp8 quantization cpp extension #1

lessw2020 · 2025-07-03T20:36:01Z

on blackwell.


=== Numerical Accuracy Testing ===
Input shape: torch.Size([4, 32])
Output shape: torch.Size([4, 32])
Scales shape: torch.Size([4, 1])
Row 0: max abs value = 4.000
Row 1: max abs value = 3.500
Row 2: max abs value = 4.000
Row 3: max abs value = 3.500

=== Performance Testing ===

Size: 2048x2048
Average time: 0.040 ms
Throughput: 422.94 GB/s

Size: 4096x4096
Average time: 0.106 ms
Throughput: 634.14 GB/s

Size: 8192x8192
Average time: 0.360 ms
Throughput: 746.15 GB/s

=== All tests completed ===

…stream will handle

danielvegamyhre · 2025-07-03T21:37:58Z

Confirmed cpp extension build and tests run w/o issue on my B200 devgpu:

(ao) [danvm@devgpu031.atn1 ~/private-torchao/torchao/experimental/mxfp8_cpp (lessw2020/cpp_mxfp8_main)]$ python test_cpp_extension.py 
GPU: NVIDIA B200
Compute Capability: 10.0
=== Testing Basic MXFP8 Quantization ===

1. Row-wise quantization only:
Output (row-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Scales (row-wise):
  Shape: torch.Size([512, 16])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192

2. Column-wise quantization only:
Output (column-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Scales (column-wise):
  Shape: torch.Size([16, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192

3. Both row-wise and column-wise quantization:
Output (row-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Output (column-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Scales (row-wise):
  Shape: torch.Size([512, 16])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192
Scales (column-wise):
  Shape: torch.Size([16, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192

=== Testing Different Block Sizes ===

1x32 blocks:
Scales shape for 1x32 blocks: torch.Size([128, 4])

32x1 blocks:
Scales shape for 32x1 blocks: torch.Size([4, 128])

=== Numerical Accuracy Testing ===
Input shape: torch.Size([4, 32])
Output shape: torch.Size([4, 32])
Scales shape: torch.Size([4, 1])
Row 0: max abs value = 4.000
Row 1: max abs value = 3.500
Row 2: max abs value = 4.000
Row 3: max abs value = 3.500

=== Performance Testing ===

Size: 256x256
Average time: 0.021 ms
Throughput: 12.71 GB/s

Size: 512x512
Average time: 0.021 ms
Throughput: 50.84 GB/s

Size: 1024x1024
Average time: 0.021 ms
Throughput: 203.38 GB/s

Size: 2048x2048
Average time: 0.025 ms
Throughput: 677.47 GB/s

Size: 4096x4096
Average time: 0.088 ms
Throughput: 764.40 GB/s

Size: 8192x8192
Average time: 0.320 ms
Throughput: 838.93 GB/s

=== All tests completed ===

lessw2020 added 7 commits July 2, 2025 22:06

compiles!

fca1df5

removed amax and fp8e5m2 to simplify code base. compiles.

d5651cb

add basic testing

2050c09

add simple numerics test, update barriers

3e01386

move to blackwell, use ptx.cuh for tma

3248251

working on blackwell, switched to hardware based fp8 conversion

3b74423

add some more unit tests (basic perf), remove kernel sync and assume …

261d702

…stream will handle

danielvegamyhre merged commit 295aeaa into main Jul 3, 2025
0 of 15 checks passed

lessw2020 deleted the lessw2020/cpp_mxfp8_main branch July 3, 2025 21:39

danielvegamyhre mentioned this pull request Jul 10, 2025

Add CUDA kernel for MXFP8 dim1 casting pytorch/ao#2513

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

working mxfp8 quantization cpp extension #1

working mxfp8 quantization cpp extension #1

Uh oh!

lessw2020 commented Jul 3, 2025

Uh oh!

danielvegamyhre commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

working mxfp8 quantization cpp extension #1

working mxfp8 quantization cpp extension #1

Uh oh!

Conversation

lessw2020 commented Jul 3, 2025

Uh oh!

danielvegamyhre commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants