Enable MXFP8 support in TE JAX integration #424

Micky774 · 2026-01-20T21:03:31Z

Description

Enables MXFP8 support in the TE JAX integration and significantly modifies tests to account for remaining support gaps

TODO:

Investigate grouped-GEMM test failures for MXFP8
Investigate newly exposed failures in test_layernorm_mlp_grad{_shardy} when using hipblaslt GEMM + bias + no scaling + bf16 dtype

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Adds new test cases to tests/jax/test_custom_call_compute.py
Corrects comparison bug in test_custom_call_compute::assert_bitwise_scaled_tensors
Removes dtype-based skip on MXFP8 GEMMs in test_custom_call_compute
Adds shape-based skips for MXFP8 GEMMs across several files
Added bias parameterization to test_dense_grad_fp8 to allow testing of MXFP8 (currently does not support bias)
Updated tests/jax/test_distributed_layernorm_mlp.py with new test shape to allow for MXFP8 usage
Added xfail to certain configs that fail with the new test case (needs a follow-up investigation)
Added explicit shape checks to transformer_engine/common/gemm/rocm_gemm.cu
Removed IS_NORM template parameter from cast_mxfp8_2D_kernel
Disabled scale_inv swizzling before GEMM on ROCm
Corrected scale_inv un-padding behavior in NormFwdPrimitive
Removed redundant un-padding in NormFwdPrimitive
Removed swizzling in GroupedGemmFFI
Skipped grouped GEMM MXFP8 tests due to outstanding failures (needs follow-up investigation)
Corrected bias bug in transformer_engine/jax/layernorm_mlp.py

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…or gfx950 ci enablement

…ed with hipblaslt

…x-mxfp8

Micky774 · 2026-01-22T18:19:37Z

Note that the CI failure is unrelated.

wangye805

There are still many other places needs rocm specific guards

tests/jax/test_custom_call_compute.py

wangye805 · 2026-01-22T19:51:38Z

tests/jax/test_custom_call_compute.py

    (2048, 1024, 1024),
 ]
-
+TEST_SHAPES = [(64, 32, 64), (128, 64, 128), (128, 256, 256)]


Guard this by is_hip_extension?

tests/jax/test_custom_call_compute.py

transformer_engine/jax/cpp_extensions/normalization.py

transformer_engine/jax/csrc/extensions/gemm.cpp

wangye805 · 2026-01-22T22:31:25Z

transformer_engine/jax/csrc/extensions/gemm.cpp

  const int arch = cuda::sm_arch();

+  #ifndef __HIP_PLATFORM_AMD__
  if (arch < 100 && is_fp8_gemm) {


I guess we also need similar check to filter non-NT gemms for gfx942

Are there any other conditions we need to guard against? I'm not sure what our support looks like for gfx942 vs gfx950 here.

For gfx942, we only support NT layout, but for gfx950, we support others (NN, TN)

wangye805 · 2026-01-22T22:34:39Z

transformer_engine/jax/csrc/extensions/gemm.cpp


  size_t num_non_empty_gemms = lhs_list.size();

-  if (is_mxfp8_scaling) {


This need rocm specific guards

wangye805 · 2026-01-26T16:40:53Z

tests/jax/test_custom_call_compute.py

    @pytest_parametrize_wrapper("scaling_mode", supported_scaling_modes)
    @pytest_parametrize_wrapper("layout", ["NN"])
    def test_grouped_gemm_fp8(self, fwd_bwd_dtype, scaling_mode, input_shape, layout):
+        if scaling_mode.is_1d_block_scaling():


ROCM guards needed here as well

wangye805 · 2026-01-26T16:41:06Z

tests/jax/test_custom_call_compute.py

    )
    @pytest_parametrize_wrapper("scaling_mode", supported_scaling_modes)
    def test_grouped_dense_grad_fp8(self, fwd_bwd_dtype, scaling_mode, input_shape):
+        if scaling_mode.is_1d_block_scaling():


ROCm guards needed as well

wangye805 · 2026-01-26T16:42:38Z