-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Transform] Deterministic Hadacore Transforms #24106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
652d578 to
be1f377
Compare
|
Am I misunderstanding or does this increase latency? |
|
@AlpinDale Yes, hadamard transforms do increase latency slightly, but they also lead to significantly better accuracy recovery for lower bit quantizations. |
be1f377 to
600b3d2
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments otherwise LGTM
...el_executor/layers/quantization/compressed_tensors/transform/schemes/linear_qutlass_nvfp4.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/transform/module.py
Outdated
Show resolved
Hide resolved
75aee7d to
37686d6
Compare
219d1f5 to
0c5c3c0
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits, lgtm
0c5c3c0 to
2f4975f
Compare
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylesayrs this kernel seems to add some decent bloat to the wheel.
Before: [2025-09-12T15:21:13Z] #31 0.650 Wheel dist/vllm-0.10.2rc3.dev22+ge776c4194.precompiled-cp38-abi3-linux_x86_64.whl is within the allowed size (396.15 MB).
After: [2025-09-11T15:04:08Z] #31 0.693 Wheel dist/vllm-0.10.2rc2.dev107+g2f4975fc8-cp38-abi3-linux_x86_64.whl is within the allowed size (416.34 MB).
Please take a look at reducing its impact. This is especially concerning since it seems only SM80 is being compiled in the CMake def. Is it compiling PTX to work on other arches too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all of these configs? The kernel size doesn't seem trivial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below comments
csrc/ops.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this kernel able to be compiled on ROCm? If not, then it needs to be guarded to be CUDA-only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda_archs_loose_intersection(HADACORE_ARCHS "8.0" "${CUDA_ARCHS}")Is this cmake compilation check not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer is no but not 100% sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the other CUDA-only kernels I see have any guards that I haven't included afaict.
|
Pre-commit failure is fixed if you merge from main |
|
@mgoin Hadacore was written for sm80, 89, and 90 only. However, as I understand it, the CUDA guard in I only see a difference of |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
…ents Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
2f4975f to
e33a801
Compare
|
No ciflow labels are configured for this repo. |
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
Dense Model Latency (sec)
Quantized Model Latency (sec)
Changes
void hadacore_transform(torch::Tensor& x)head_dim(ie, apply transform to groups of activations)qutlass_fp4_schemeTesting