[Transform] Deterministic Hadacore Transforms #24106

kylesayrs · 2025-09-02T15:03:17Z

Purpose

Accelerate inference of hadamard transforms using the Hadacore kernels provided by Meta.

Dense Model Latency (sec)

Base	Hadacore	GEMM
0.4710	0.4948	1.3946

Quantized Model Latency (sec)

Base W4A16	Hadacore	GEMM
0.4402	0.4489	1.2917

Changes

Adapted Hadacore implementation to expose void hadacore_transform(torch::Tensor& x)
Support loading and applying transforms with head_dim (ie, apply transform to groups of activations)
Expand stub for qutlass_fp4_scheme

Testing

Validated performance using latency benchmarks
Verified model sanity, more thorough evals to come

AlpinDale · 2025-09-03T04:20:48Z

Am I misunderstanding or does this increase latency?

kylesayrs · 2025-09-03T15:36:31Z

@AlpinDale Yes, hadamard transforms do increase latency slightly, but they also lead to significantly better accuracy recovery for lower bit quantizations.

See QuIP# and SpinQuant

ProExpertProg

A few minor comments otherwise LGTM

...el_executor/layers/quantization/compressed_tensors/transform/schemes/linear_qutlass_nvfp4.py

vllm/model_executor/layers/quantization/compressed_tensors/transform/module.py

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu

ProExpertProg

Just nits, lgtm

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu

mgoin

@kylesayrs this kernel seems to add some decent bloat to the wheel.

Before: [2025-09-12T15:21:13Z] #31 0.650 Wheel dist/vllm-0.10.2rc3.dev22+ge776c4194.precompiled-cp38-abi3-linux_x86_64.whl is within the allowed size (396.15 MB).
After: [2025-09-11T15:04:08Z] #31 0.693 Wheel dist/vllm-0.10.2rc2.dev107+g2f4975fc8-cp38-abi3-linux_x86_64.whl is within the allowed size (416.34 MB).

Please take a look at reducing its impact. This is especially concerning since it seems only SM80 is being compiled in the CMake def. Is it compiling PTX to work on other arches too?

mgoin · 2025-09-12T18:53:59Z

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu

Do we need all of these configs? The kernel size doesn't seem trivial

See below comments

mgoin · 2025-09-12T18:56:36Z

csrc/ops.h

Is this kernel able to be compiled on ROCm? If not, then it needs to be guarded to be CUDA-only

cuda_archs_loose_intersection(HADACORE_ARCHS "8.0" "${CUDA_ARCHS}")

Is this cmake compilation check not enough?

I think the answer is no but not 100% sure

None of the other CUDA-only kernels I see have any guards that I haven't included afaict.

hmellor · 2025-09-12T19:20:41Z

Pre-commit failure is fixed if you merge from main

kylesayrs · 2025-09-12T23:19:23Z

@mgoin Hadacore was written for sm80, 89, and 90 only. However, as I understand it, the CUDA guard in CMakeLists.txt only compiles on sm80 devices (we should probably update this to compile for any arch >= 80).

I only see a difference of 72M -> 73M in vllm/_C.abi3.so while compiling hadacore. Maybe I'm looking at the wrong compilation target?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

…ents Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

pytorch-bot · 2025-09-15T09:35:51Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

kylesayrs mentioned this pull request Sep 2, 2025

[Transform] Deterministic Hadamard Transforms #23705

Closed

mergify bot added the ci/build label Sep 2, 2025

kylesayrs force-pushed the kylesayrs/hadacore branch from 652d578 to be1f377 Compare September 2, 2025 20:44

kylesayrs marked this pull request as ready for review September 2, 2025 20:58

kylesayrs requested review from LucasWilkinson, mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 2, 2025 20:58

kylesayrs force-pushed the kylesayrs/hadacore branch from be1f377 to 600b3d2 Compare September 3, 2025 16:08

ProExpertProg reviewed Sep 3, 2025

View reviewed changes

tlrmchlsmth reviewed Sep 3, 2025

View reviewed changes

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Sep 3, 2025

View reviewed changes

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/hadacore branch 3 times, most recently from 75aee7d to 37686d6 Compare September 6, 2025 20:39

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 6, 2025

kylesayrs force-pushed the kylesayrs/hadacore branch 4 times, most recently from 219d1f5 to 0c5c3c0 Compare September 10, 2025 14:38

ProExpertProg approved these changes Sep 10, 2025

View reviewed changes

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu Outdated Show resolved Hide resolved

csrc/quantization/hadamard/hadacore/hadamard_transform_cuda.cu Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/hadacore branch from 0c5c3c0 to 2f4975f Compare September 11, 2025 14:40

ProExpertProg approved these changes Sep 12, 2025

View reviewed changes

mgoin requested changes Sep 12, 2025

View reviewed changes

mgoin reviewed Sep 12, 2025

View reviewed changes

kylesayrs added 10 commits September 15, 2025 05:35

fix type issues

f4f3537

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

cleanup, docstrings

7339bc2

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

cleanup, ect

4ce8af5

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

add compiler gate

64578ec

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

gate 2

0ea8125

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

use luka util

6e4a5ad

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

remove 32 bit pointer math

22de1c0

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

use explicit 64 int width

57830d9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

nits

9d1b1fa

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

fix unsigned int underflow issue, update function signature, add comm…

e33a801

…ents Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs force-pushed the kylesayrs/hadacore branch from 2f4975f to e33a801 Compare September 15, 2025 09:35

pytorch-bot bot removed the ci/build label Sep 15, 2025

mergify bot added the ci/build label Sep 15, 2025

kylesayrs requested a review from mgoin September 15, 2025 09:36

kylesayrs marked this pull request as ready for review September 15, 2025 09:44

kylesayrs requested a review from WoosukKwon as a code owner September 15, 2025 09:44

mgoin approved these changes Sep 15, 2025

View reviewed changes

mgoin merged commit a0b2670 into vllm-project:main Sep 15, 2025
81 of 82 checks passed

tlrmchlsmth pushed a commit to tlrmchlsmth/vllm that referenced this pull request Sep 15, 2025

[Transform] Deterministic Hadacore Transforms (vllm-project#24106)

c7e74c3

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Transform] Deterministic Hadacore Transforms (vllm-project#24106)

64d98d9

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[Transform] Deterministic Hadacore Transforms (vllm-project#24106)

bcdc41b

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Transform] Deterministic Hadacore Transforms (vllm-project#24106)

dfcde98

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[Transform] Deterministic Hadacore Transforms (vllm-project#24106)

a569e70

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

This was referenced Nov 12, 2025

online_rotations #15162

Closed

[RFC]: 4-bit KV cache quantization through Hadamard transforms #28538

Open

Uh oh!

[Transform] Deterministic Hadacore Transforms #24106

[Transform] Deterministic Hadacore Transforms #24106

Uh oh!

Conversation

kylesayrs commented Sep 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Dense Model Latency (sec)

Quantized Model Latency (sec)

Changes

Testing

Uh oh!

AlpinDale commented Sep 3, 2025

Uh oh!

kylesayrs commented Sep 3, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor commented Sep 12, 2025

Uh oh!

kylesayrs commented Sep 12, 2025

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kylesayrs commented Sep 2, 2025 •

edited by github-actions bot

Loading