NVfp4 #2408

drisspg · 2025-06-18T20:33:13Z

Stacked PRs:

->NVfp4 #2408

Add NVFP4 Inference flow

Details:

I kept this separate for MX but realistically we should probably merge the two. Basic support for blocksize 16 + e4m3 scales.

Double Quant Update

Ignore previous comments, the double quant is actually really similar to NF4 where you just scale the fp32 scales prior to casting to e4m3 to try and reduce scale quant error.

I have that implemented now in the Nvfp4 code if a tesor_scale is given, just need to figure out how to thread to cublas param scale_in_d or how we want to expose this. We currently don't expose the C matrix to the Python API so we could use alpha as @gau-nernst pointed out to me, however we dont expose alpha either 🙃. However if we wanted to use alpha we would need the value on the host, the sync would likely rule out this option. I might keep this double quant on hold until we have the public api, since I am thinking about adding scale overloads to addmm. However I read the cublas docs many times and it feels as though passing to scale result should work since we don't set the d_mode and its default value should work.

Early Perf

No double quant here

python /home/drisspg/meta/vllm/benchmarks/benchmark_throughput.py \
 --backend vllm \
 --model "data/nvfp4-Qwen3-8B" \
 --dataset-name sharegpt \
 --dataset-path data/ShareGPT_V3_unfiltered_cleaned_split.json \
 --num-prompts 1024 \
 --disable-log-stats \
 --gpu-memory-utilization=0.9 \
 --seed 42

Throughput: 43.23 requests/s, 18347.24 total tokens/s, 8840.47 output tokens/s
Total num prompt tokens:  225190
Total num output tokens:  209407

which is even worse than mxfp4..., will profile later

Micro Bench

LLama 70B mlp no TP:

Model Configuration	Runtime (μs/iteration)	Speedup vs BF16
BF16	1353.09	1.00x
mxfp8	766.76	1.76x
mxfp4	638.00	2.12x
nvfp4	540.41	2.50x

Diffusers

# Bf16 Compile
|           ckpt_id            |   batch_size |  fuse  |  compile  |  compile_vae  |  quantization  |  sparsify  |   model_memory |   inference_memory |   time |
|:----------------------------:|-------------:|:------:|:---------:|:-------------:|:--------------:|:----------:|---------------:|-------------------:|-------:|
| black-forest-labs/FLUX.1-dev |            1 | False  |   True    |     False     |      None      |   False    |         31.438 |             33.827 |  3.286 |

Errors

Annoyingly we are getting an error due to the view as fp4x2 + packing https://fburl.com/cd92w431 because this is trying to be bitcast iside inside triton kernel which is very annoying. Not sure how this didn't show up until vllm / w/ mxfp4
^ similar to this: triton-lang/triton#6054 but make the same changes in _inductor/utils.py as we did for float8em0

Numerics

pytorch-bot · 2025-06-18T20:33:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2408

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit d85d39a with merge base 101c039 ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/quantization/pt2e/test_numeric_debugger.py::TestNumericDebuggerInfra::test_prepare_for_propagation_comparison
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
test/quantization/pt2e/test_numeric_debugger.py::TestNumericDebuggerInfra::test_prepare_for_propagation_comparison
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t 669bec87c8a304d6313218af7fca8566a24b04d3a50828b7a168acc1effef16e /exec failed with exit code 1
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t db150226c32ca760871208e019ac234aaa6f0499148bb3947e4d0525b64e570c /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg · 2025-06-20T02:40:17Z

@vkuzo Curious if you agree we should roll this into the existing mx tensor?

stack-info: PR: #2408, branch: drisspg/stack/78

vkuzo · 2025-06-21T02:51:11Z

torchao/prototype/mx_formats/mx_subclass.py

+        "scale": None,
+    }
+
+    quantized_weight = to_linear_activation_quantized(


can we just write the logic here instead of using to_linear_activation_quantized? I remember same feedback on the mxfp4 inference tensor.

So you can't just move the logic out here, the entirety of the forward behavior has to be "wrapped" by the subclass. Currently there are two ways to do that, without changing nn.modules.

Like above; this is subclass composition

The other is to copy the same behavior into the implementations of the ops,
e.g.

NVFP4's dispatch would need to copy:

ao/torchao/quantization/linear_activation_quantized_tensor.py

Lines 135 to 186 in 7192edf

@implements([torch.nn.functional.linear, aten.linear.default])

def _(func, types, args, kwargs):

input_tensor, weight_tensor, bias = (

args[0],

args[1],

args[2] if len(args) > 2 else None,

)

if isinstance(weight_tensor, LinearActivationQuantizedTensor):

return weight_tensor._quantized_linear_op(input_tensor, weight_tensor, bias)

raise NotImplementedError(

"LinearActivationQuantizedTensor: No specialized dispatch found for linear op"

)

@implements([aten.mm.default, aten.addmm.default])

def _(func, types, args, kwargs):

if not args[0].is_floating_point():

raise NotImplementedError(

"LinearActivationQuantizedTensor: expecting a floating point input"

)

if func == aten.addmm.default:

assert args[1].shape[-1] == args[2].shape[0], (

f"need mat1 shape: {args[1].shape} final"

f"dim to match mat2 shape: {args[2].shape} first dim "

)

input_tensor, weight_tensor, bias = (

args[1],

args[2],

args[0],

)

input_quant_func = weight_tensor.input_quant_func

original_weight_tensor = weight_tensor.original_weight_tensor

qtensor = input_quant_func(input_tensor, **weight_tensor.quant_kwargs)

return func(bias, qtensor, original_weight_tensor)

else:

# aten.mm.default

assert args[0].shape[-1] == args[1].shape[0], (

f"need mat1 shape: {args[0].shape} final dim"

f"to match mat2 shape: {args[1].shape} first dim"

)

input_tensor, weight_tensor = (

args[0],

args[1],

)

input_quant_func = weight_tensor.input_quant_func

original_weight_tensor = weight_tensor.original_weight_tensor

qtensor = input_quant_func(input_tensor, **weight_tensor.quant_kwargs)

return func(qtensor, original_weight_tensor)

Not the end of the world. But for some subclasses that serve dual purpose (dyanmic + weight only, + static, + training) it can be alot of switch statements in the ops as opposed to having the base subclass + some sugar

torchao/prototype/mx_formats/mx_subclass.py

torchao/prototype/mx_formats/nvfp4_tensor.py

vkuzo · 2025-06-21T02:56:16Z

torchao/prototype/mx_formats/nvfp4_tensor.py

+    M, K = a.shape[0], a.shape[1]
+    N = b.shape[1]
+
+    # Swizzle Dizzle


Found in: pytorch/ao#2408 Pull Request resolved: #156461 Approved by: https://github.com/vkuzo

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg · 2025-06-21T07:11:00Z

torchao/prototype/mx_formats/nvfp4_tensor.py

+    if per_tensor_scale is None:
+        # We are doing single level scaling
+        block_scale_fp8 = torch.clamp(block_scale, min=E4M3_EPS, max=F8E4M3_MAX).to(
+            torch.float8_e4m3fn


The down up down pattern of casts is probs overkill

drisspg added a commit that referenced this pull request Jun 18, 2025

WIP NVfp4

3948f5d

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from c58c5b0 to 3948f5d Compare June 18, 2025 20:33

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2025

drisspg added a commit that referenced this pull request Jun 18, 2025

WIP NVfp4

1025236

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 3948f5d to 1025236 Compare June 18, 2025 21:05

drisspg added a commit that referenced this pull request Jun 18, 2025

WIP NVfp4

1c007a4

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 1025236 to 1c007a4 Compare June 18, 2025 21:30

drisspg added mx topic: new feature Use this tag if this PR adds a new feature labels Jun 19, 2025

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

a3d2874

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 1c007a4 to a3d2874 Compare June 19, 2025 04:19

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

034f892

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from a3d2874 to 034f892 Compare June 19, 2025 04:26

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

92e0622

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 034f892 to 92e0622 Compare June 19, 2025 04:27

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

b2c45a1

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 92e0622 to b2c45a1 Compare June 19, 2025 04:38

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

7448f45

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from b2c45a1 to 7448f45 Compare June 19, 2025 04:56

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

fad58b5

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 7448f45 to fad58b5 Compare June 19, 2025 16:00

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

b5a593d

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from fad58b5 to b5a593d Compare June 19, 2025 16:03

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

2b4ba64

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from b5a593d to 2b4ba64 Compare June 19, 2025 16:31

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

5d50579

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 2b4ba64 to 5d50579 Compare June 19, 2025 23:00

drisspg added a commit that referenced this pull request Jun 19, 2025

WIP NVfp4

29fa9ef

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 5d50579 to 29fa9ef Compare June 19, 2025 23:36

drisspg force-pushed the drisspg/stack/78 branch from bde7328 to b5e3a78 Compare June 20, 2025 02:39

drisspg changed the title ~~Add NVfp4 Inference Flow~~ NVfp4 Jun 20, 2025

drisspg requested a review from vkuzo June 20, 2025 02:39

drisspg mentioned this pull request Jun 20, 2025

Workaround for e4m2 dtype pytorch/pytorch#156461

Closed

drisspg added a commit that referenced this pull request Jun 20, 2025

NVfp4

8b5df79

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from b5e3a78 to 8b5df79 Compare June 20, 2025 21:20

drisspg added a commit that referenced this pull request Jun 20, 2025

NVfp4

79720b3

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 8b5df79 to 79720b3 Compare June 20, 2025 22:04

drisspg added a commit that referenced this pull request Jun 20, 2025

NVfp4

f194e35

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from 79720b3 to f194e35 Compare June 20, 2025 23:56

drisspg added a commit that referenced this pull request Jun 21, 2025

NVfp4

b08a108

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from f194e35 to b08a108 Compare June 21, 2025 00:17

drisspg marked this pull request as draft June 21, 2025 00:46

vkuzo reviewed Jun 21, 2025

View reviewed changes

torchao/prototype/mx_formats/mx_subclass.py Outdated Show resolved Hide resolved

vkuzo reviewed Jun 21, 2025

View reviewed changes

torchao/prototype/mx_formats/nvfp4_tensor.py Show resolved Hide resolved

vkuzo reviewed Jun 21, 2025

View reviewed changes

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jun 21, 2025

Workaround for e4m2 dtype (#156461)

88b9c28

Found in: pytorch/ao#2408 Pull Request resolved: #156461 Approved by: https://github.com/vkuzo

drisspg added a commit that referenced this pull request Jun 21, 2025

NVfp4

ac989f2

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from b08a108 to ac989f2 Compare June 21, 2025 05:54

drisspg added a commit that referenced this pull request Jun 21, 2025

NVfp4

e9708eb

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from ac989f2 to e9708eb Compare June 21, 2025 06:22

drisspg added a commit that referenced this pull request Jun 21, 2025

NVfp4

b03916b

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch 2 times, most recently from b03916b to d62d63b Compare June 21, 2025 06:25

drisspg added a commit that referenced this pull request Jun 21, 2025

NVfp4

d62d63b

stack-info: PR: #2408, branch: drisspg/stack/78

NVfp4

d85d39a

stack-info: PR: #2408, branch: drisspg/stack/78

drisspg force-pushed the drisspg/stack/78 branch from d62d63b to d85d39a Compare June 21, 2025 06:51

drisspg commented Jun 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NVfp4 #2408

NVfp4 #2408

Uh oh!

drisspg commented Jun 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading

Uh oh!

drisspg commented Jun 20, 2025

Uh oh!

vkuzo Jun 21, 2025

Uh oh!

drisspg Jun 21, 2025

Uh oh!

Uh oh!

Uh oh!

vkuzo Jun 21, 2025

Uh oh!

drisspg Jun 21, 2025

Uh oh!

Uh oh!



	@implements([torch.nn.functional.linear, aten.linear.default])
	def _(func, types, args, kwargs):
	input_tensor, weight_tensor, bias = (
	args[0],
	args[1],
	args[2] if len(args) > 2 else None,
	)
	if isinstance(weight_tensor, LinearActivationQuantizedTensor):
	return weight_tensor._quantized_linear_op(input_tensor, weight_tensor, bias)

	raise NotImplementedError(
	"LinearActivationQuantizedTensor: No specialized dispatch found for linear op"
	)


	@implements([aten.mm.default, aten.addmm.default])
	def _(func, types, args, kwargs):
	if not args[0].is_floating_point():
	raise NotImplementedError(
	"LinearActivationQuantizedTensor: expecting a floating point input"
	)

	if func == aten.addmm.default:
	assert args[1].shape[-1] == args[2].shape[0], (
	f"need mat1 shape: {args[1].shape} final"
	f"dim to match mat2 shape: {args[2].shape} first dim "
	)
	input_tensor, weight_tensor, bias = (
	args[1],
	args[2],
	args[0],
	)
	input_quant_func = weight_tensor.input_quant_func
	original_weight_tensor = weight_tensor.original_weight_tensor
	qtensor = input_quant_func(input_tensor, **weight_tensor.quant_kwargs)
	return func(bias, qtensor, original_weight_tensor)
	else:
	# aten.mm.default
	assert args[0].shape[-1] == args[1].shape[0], (
	f"need mat1 shape: {args[0].shape} final dim"
	f"to match mat2 shape: {args[1].shape} first dim"
	)
	input_tensor, weight_tensor = (
	args[0],
	args[1],
	)
	input_quant_func = weight_tensor.input_quant_func
	original_weight_tensor = weight_tensor.original_weight_tensor
	qtensor = input_quant_func(input_tensor, **weight_tensor.quant_kwargs)
	return func(qtensor, original_weight_tensor)

NVfp4 #2408

Are you sure you want to change the base?

NVfp4 #2408

Uh oh!

Conversation

drisspg commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add NVFP4 Inference flow

Double Quant Update

Early Perf

Micro Bench

Diffusers

Errors

Numerics

Uh oh!

pytorch-bot bot commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2408

❌ 4 New Failures

Uh oh!

drisspg commented Jun 20, 2025

Uh oh!

vkuzo Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vkuzo Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg Jun 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drisspg commented Jun 18, 2025 •

edited

Loading

pytorch-bot bot commented Jun 18, 2025 •

edited

Loading