Skip to content

Conversation

@lessw2020
Copy link
Collaborator

on blackwell.


=== Numerical Accuracy Testing ===
Input shape: torch.Size([4, 32])
Output shape: torch.Size([4, 32])
Scales shape: torch.Size([4, 1])
Row 0: max abs value = 4.000
Row 1: max abs value = 3.500
Row 2: max abs value = 4.000
Row 3: max abs value = 3.500

=== Performance Testing ===

Size: 2048x2048
Average time: 0.040 ms
Throughput: 422.94 GB/s

Size: 4096x4096
Average time: 0.106 ms
Throughput: 634.14 GB/s

Size: 8192x8192
Average time: 0.360 ms
Throughput: 746.15 GB/s

=== All tests completed ===

@danielvegamyhre
Copy link
Owner

Confirmed cpp extension build and tests run w/o issue on my B200 devgpu:

(ao) [danvm@devgpu031.atn1 ~/private-torchao/torchao/experimental/mxfp8_cpp (lessw2020/cpp_mxfp8_main)]$ python test_cpp_extension.py 
GPU: NVIDIA B200
Compute Capability: 10.0
=== Testing Basic MXFP8 Quantization ===

1. Row-wise quantization only:
Output (row-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Scales (row-wise):
  Shape: torch.Size([512, 16])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192

2. Column-wise quantization only:
Output (column-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Scales (column-wise):
  Shape: torch.Size([16, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192

3. Both row-wise and column-wise quantization:
Output (row-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Output (column-wise):
  Shape: torch.Size([512, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 262144
Scales (row-wise):
  Shape: torch.Size([512, 16])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192
Scales (column-wise):
  Shape: torch.Size([16, 512])
  Dtype: torch.uint8
  Device: cuda:0
  Size in bytes: 8192

=== Testing Different Block Sizes ===

1x32 blocks:
Scales shape for 1x32 blocks: torch.Size([128, 4])

32x1 blocks:
Scales shape for 32x1 blocks: torch.Size([4, 128])

=== Numerical Accuracy Testing ===
Input shape: torch.Size([4, 32])
Output shape: torch.Size([4, 32])
Scales shape: torch.Size([4, 1])
Row 0: max abs value = 4.000
Row 1: max abs value = 3.500
Row 2: max abs value = 4.000
Row 3: max abs value = 3.500

=== Performance Testing ===

Size: 256x256
Average time: 0.021 ms
Throughput: 12.71 GB/s

Size: 512x512
Average time: 0.021 ms
Throughput: 50.84 GB/s

Size: 1024x1024
Average time: 0.021 ms
Throughput: 203.38 GB/s

Size: 2048x2048
Average time: 0.025 ms
Throughput: 677.47 GB/s

Size: 4096x4096
Average time: 0.088 ms
Throughput: 764.40 GB/s

Size: 8192x8192
Average time: 0.320 ms
Throughput: 838.93 GB/s

=== All tests completed ===

@danielvegamyhre danielvegamyhre merged commit 295aeaa into main Jul 3, 2025
0 of 15 checks passed
@lessw2020 lessw2020 deleted the lessw2020/cpp_mxfp8_main branch July 3, 2025 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants