Fix Float8Tensor quantize op kernrel preference dispatch #2883

jerryzh168 · 2025-08-26T22:22:35Z

Stacked PRs:

Fix Float8Tensor quantize op kernrel preference dispatch

Summary:
Previously if user specifies kernel_preference == "fbgemm", we'll use torch ops like _choose_scale_float8 and _quantize_affine_float8 to quantize the high precision Tensor
into a float8 Tensor

this PR makes sure we use fbgemm kernels when kernel_preference is "fbgemm", meaning: torch.ops.triton.quantize_fp8_row for per row, and torch.ops.fbgemm.quantize_fp8_per_tensor
for per tensor (while torch.ops.fbgemm.quantize_fp8_per_tensor has some issues right now and we'll enable later when it's fixed)

This doesn't have impact on BC, meaning old serialized model can still be loaded and run, only thing is fixing the kernel choice for fbgemm kernel preference
means users who requested FBGEMM kernelpreference now actually run fbgemm quantize op instead of torch op

Test Plan:
python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-08-26T22:22:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2883

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4516f6e with merge base 2a53216 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/dtypes/test_affine_quantized_float.py

test/quantization/quantize_/workflows/float8/test_float8_tensor.py

torchao/quantization/quantize_/workflows/float8/float8_tensor.py

drisspg · 2025-08-28T01:58:14Z

torchao/quantization/quantize_/workflows/float8/float8_tensor.py

            kernel_choice = "fbgemm"
+        elif weight_tensor.kernel_preference == KernelPreference.TRITON:
+            # no triton gemm op is available, so we'll fallback to torch
+            kernel_choice = "torch"


also weird, your kernel choice is doing double duty and now a recipe. that recipe is also not very clear from your initial doc

do you mean kernel choice is used for both quantize and gemm?

what is the initial doc you are referring to?

Yeah exactly, The doc is just the code block and reading on the kernel choice doc string

kernel choice is used for both quantize and gemm

yeah that's a decision we made before, according to Josh there is no need to have kernel level choice for now, just to keep things simple.

we did mention this in the KernelPreference doc I think?

IMO it should be:

fbgemm - use all fbgemm kernels, error out if something is not supported

torch - use all torch kernels, error out if something is not supported

auto - torchao decides what to do

we should not use torch kernels in the fbgemm setting, as that is not honoring what the user asked for

vkuzo · 2025-08-28T11:57:39Z

torchao/quantization/quantize_/common/kernel_preference.py


+    """Use triton quantize and quantized mm kernels (if available), requires fbgemm_gpu_genai library, if no triton kernel for the quantize op or mm kernel is available, we'll fallback to torch ops
+    """
+    TRITON = "triton"


this name isn't coherent with the rest of the enum. We already have an FBGEMM option which does not say anything about cutlass vs triton and therefore already includes these kernels. I think you have two options:

have the fbgemm option pick the best kernel (cutlass vs triton) for the user. I prefer this one.

make it clear that "FBGEMM" does not mean "FBGEMM", but really means "FBGEMM_CUTLASS", and also add "FBGEMM_TRITON". I don't really like this option.

OK thanks, yeah 1 seems easiest for now, will update to that. unless there is request to distinguish these in the future

drisspg · 2025-08-28T19:28:31Z

test/quantization/quantize_/workflows/float8/test_float8_tensor.py

+        if (
+            isinstance(granularity, PerTensor)
+            and kernel_preference == KernelPreference.FBGEMM
+        ):


Nit: lets Xfail this

we are using unittest, seems like we can't do return unittest.expectedFailure("...")?

.../ao/test/quantization/quantize_/workflows/float8/test_float8_tensor.py", line 92, in test_fp8_linear_variants return unittest.expectedFailure( File ".../python3.10/unittest/case.py", line 148, in expectedFailure test_item.__unittest_expecting_failure__ = True AttributeError: 'str' object has no attribute '__unittest_expecting_failure__'

but let me know if there is an example to do expectedFailure conditionally instead of skipping entire test

vkuzo · 2025-08-28T19:33:13Z

Previously we didn't handle kernel_preference == "fbgemm" properly for the quantize op,
this PR makes sure we dispatch to fbgemm kernels when kernel_preference is fbgemm

Can you explain specifically what did not work, and why it works after this PR? It would also be good to have a test which fails before this PR and passes after this PR.

jerryzh168 · 2025-08-28T20:29:07Z

@vkuzo sure, updated the PR summary and added a test for this one and next PR as well

vkuzo · 2025-08-28T21:51:57Z

fixing the kernel choice for fbgemm kernel preference, which is supposed to be a developer facing API

note that I don't think this is true, all values of KernelPreference are user facing

vkuzo · 2025-08-28T21:53:11Z

test/quantization/quantize_/workflows/float8/test_float8_tensor.py

        config = Float8DynamicActivationFloat8WeightConfig(granularity=granularity)
        self._test_moe_weight_reshape_ops(config)

+    def test_expected_gpu_kernel_fbgemm(self):


I think this test should be together with the other tests we have which check the same thing for other settings of this config, currently in test_affine_quantized_float.py. Can we add a TODO to unify?

yeah I think we can put everything here after we deprecate the AQT path in 9 months

jerryzh168 · 2025-08-28T22:29:46Z

fixing the kernel choice for fbgemm kernel preference, which is supposed to be a developer facing API

note that I don't think this is true, all values of KernelPreference are user facing

makes sense, it is user facing

Summary: Previously if user specifies kernel_preference == "fbgemm", we'll use torch ops like `_choose_scale_float8` and `_quantize_affine_float8` to quantize the high precision Tensor into a float8 Tensor this PR makes sure we use fbgemm kernels when kernel_preference is "fbgemm", meaning: `torch.ops.triton.quantize_fp8_row` for per row, and `torch.ops.fbgemm.quantize_fp8_per_tensor` for per tensor (while `torch.ops.fbgemm.quantize_fp8_per_tensor` has some issues right now and we'll enable later when it's fixed) This doesn't have impact on BC, meaning old serialized model can still be loaded and run, only thing is fixing the kernel choice for fbgemm kernel preference means users who requested FBGEMM kernelpreference now actually run fbgemm quantize op instead of torch op Test Plan: python test/quantization/quantize_/workflows/float8/test_float8_tensor.py -k test_expected_gpu_kernel_fbgemm Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2883, branch: jerryzh168/stack/59

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 26, 2025

jerryzh168 force-pushed the jerryzh168/stack/59 branch from 6935cc8 to bacbe8c Compare August 26, 2025 22:22

jerryzh168 mentioned this pull request Aug 26, 2025

Float8Tensor per row quantization pass bias to fbgemm kernel #2884

Merged

jerryzh168 requested review from drisspg and vkuzo August 26, 2025 22:23

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Aug 26, 2025

jerryzh168 force-pushed the jerryzh168/stack/59 branch 2 times, most recently from 6cf26bd to 815a964 Compare August 27, 2025 23:47

jerryzh168 changed the title ~~Fix Float8Tensor quantize op kernrel preference dispatch~~ Fix Float8Tensor quantize op kernel preference dispatch Aug 28, 2025

vkuzo reviewed Aug 28, 2025

View reviewed changes

test/dtypes/test_affine_quantized_float.py Outdated Show resolved Hide resolved

drisspg reviewed Aug 28, 2025

View reviewed changes

test/quantization/quantize_/workflows/float8/test_float8_tensor.py Outdated Show resolved Hide resolved

drisspg reviewed Aug 28, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/float8/float8_tensor.py Outdated Show resolved Hide resolved

vkuzo requested changes Aug 28, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/float8/float8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 force-pushed the jerryzh168/stack/59 branch from 815a964 to a78fc11 Compare August 28, 2025 00:19

jerryzh168 changed the title ~~Fix Float8Tensor quantize op kernel preference dispatch~~ Fix Float8Tensor quantize op kernrel preference dispatch Aug 28, 2025

vkuzo reviewed Aug 28, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/float8/float8_tensor.py Outdated Show resolved Hide resolved

jerryzh168 force-pushed the jerryzh168/stack/59 branch 4 times, most recently from be69537 to 5f6ec32 Compare August 28, 2025 00:46

jerryzh168 requested review from drisspg and vkuzo August 28, 2025 01:43

drisspg reviewed Aug 28, 2025

View reviewed changes

torchao/quantization/quantize_/workflows/float8/float8_tensor.py Show resolved Hide resolved

drisspg reviewed Aug 28, 2025

View reviewed changes

jerryzh168 force-pushed the jerryzh168/stack/59 branch 2 times, most recently from 8ea051d to 74dd7dd Compare August 28, 2025 04:15

vkuzo requested changes Aug 28, 2025

View reviewed changes

jerryzh168 force-pushed the jerryzh168/stack/59 branch from 74dd7dd to 0b2ab3e Compare August 28, 2025 17:13

jerryzh168 requested a review from vkuzo August 28, 2025 18:52

jerryzh168 requested a review from drisspg August 28, 2025 18:52

drisspg reviewed Aug 28, 2025

View reviewed changes

drisspg approved these changes Aug 28, 2025

View reviewed changes

jerryzh168 force-pushed the jerryzh168/stack/59 branch 3 times, most recently from a1f4504 to a73fa51 Compare August 28, 2025 20:28

vkuzo reviewed Aug 28, 2025

View reviewed changes

jerryzh168 force-pushed the jerryzh168/stack/59 branch from a73fa51 to f685f8b Compare August 28, 2025 22:17

jerryzh168 force-pushed the jerryzh168/stack/59 branch from f685f8b to 8bea88e Compare August 28, 2025 22:35

jerryzh168 force-pushed the jerryzh168/stack/59 branch from 8bea88e to 4516f6e Compare August 28, 2025 22:43

jerryzh168 requested a review from vkuzo August 29, 2025 16:56

jerryzh168 merged commit fbe3df9 into main Aug 29, 2025
18 checks passed

draftbk mentioned this pull request Sep 7, 2025

[rocm] enable torchao quantization for rocm vllm-project/vllm#24400

Merged

Fix Float8Tensor quantize op kernrel preference dispatch #2883

Fix Float8Tensor quantize op kernrel preference dispatch #2883

Uh oh!

Conversation

jerryzh168 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!