Skip to content

Conversation

@kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Sep 2, 2025

Purpose

  • Accelerate inference of hadamard transforms using the Hadacore kernels provided by Meta.

Dense Model Latency (sec)

Base Hadacore GEMM
0.4710 0.4948 1.3946

Quantized Model Latency (sec)

Base W4A16 Hadacore GEMM
0.4402 0.4489 1.2917

Changes

  • Adapted Hadacore implementation to expose void hadacore_transform(torch::Tensor& x)
  • Support loading and applying transforms with head_dim (ie, apply transform to groups of activations)
  • Expand stub for qutlass_fp4_scheme

Testing

  • Validated performance using latency benchmarks
  • Verified model sanity, more thorough evals to come

@AlpinDale
Copy link
Contributor

Am I misunderstanding or does this increase latency?

@kylesayrs
Copy link
Contributor Author

@AlpinDale Yes, hadamard transforms do increase latency slightly, but they also lead to significantly better accuracy recovery for lower bit quantizations.

See QuIP# and SpinQuant

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments otherwise LGTM

@kylesayrs kylesayrs force-pushed the kylesayrs/hadacore branch 3 times, most recently from 75aee7d to 37686d6 Compare September 6, 2025 20:39
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 6, 2025
@kylesayrs kylesayrs force-pushed the kylesayrs/hadacore branch 4 times, most recently from 219d1f5 to 0c5c3c0 Compare September 10, 2025 14:38
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits, lgtm

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylesayrs this kernel seems to add some decent bloat to the wheel.

Before: [2025-09-12T15:21:13Z] #31 0.650 Wheel dist/vllm-0.10.2rc3.dev22+ge776c4194.precompiled-cp38-abi3-linux_x86_64.whl is within the allowed size (396.15 MB).
After: [2025-09-11T15:04:08Z] #31 0.693 Wheel dist/vllm-0.10.2rc2.dev107+g2f4975fc8-cp38-abi3-linux_x86_64.whl is within the allowed size (416.34 MB).

Please take a look at reducing its impact. This is especially concerning since it seems only SM80 is being compiled in the CMake def. Is it compiling PTX to work on other arches too?

Comment on lines +736 to +764
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all of these configs? The kernel size doesn't seem trivial

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below comments

csrc/ops.h Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this kernel able to be compiled on ROCm? If not, then it needs to be guarded to be CUDA-only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_archs_loose_intersection(HADACORE_ARCHS "8.0" "${CUDA_ARCHS}")

Is this cmake compilation check not enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is no but not 100% sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the other CUDA-only kernels I see have any guards that I haven't included afaict.

@hmellor
Copy link
Member

hmellor commented Sep 12, 2025

Pre-commit failure is fixed if you merge from main

@kylesayrs
Copy link
Contributor Author

@mgoin Hadacore was written for sm80, 89, and 90 only. However, as I understand it, the CUDA guard in CMakeLists.txt only compiles on sm80 devices (we should probably update this to compile for any arch >= 80).

I only see a difference of 72M -> 73M in vllm/_C.abi3.so while compiling hadacore. Maybe I'm looking at the wrong compilation target?

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
…ents

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@pytorch-bot pytorch-bot bot removed the ci/build label Sep 15, 2025
@mergify mergify bot added the ci/build label Sep 15, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@kylesayrs kylesayrs requested a review from mgoin September 15, 2025 09:36
@kylesayrs kylesayrs marked this pull request as ready for review September 15, 2025 09:44
@mgoin mgoin merged commit a0b2670 into vllm-project:main Sep 15, 2025
81 of 82 checks passed
tlrmchlsmth pushed a commit to tlrmchlsmth/vllm that referenced this pull request Sep 15, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants