[Kernel] Add Kernel Support for NVFP4 #12519

pavanimajety · 2025-01-28T18:07:40Z

This commit adds gemms for NVFP4 datatype and quantization kernels to convert to NVFP4

Co-authored by kahmadian@nvidia.com
Co-authored by kaixih@nvidia.com

github-actions · 2025-01-28T18:07:51Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

pavanimajety · 2025-01-28T19:17:15Z

tests/kernels/test_nvfp4_gemm.py

Needs correction when m is < 128

robertgshaw2-redhat · 2025-01-28T20:13:59Z

Exciting!!!!

vllm/_custom_ops.py

robertgshaw2-redhat · 2025-01-29T02:51:50Z

vllm/_custom_ops.py

please call this cutlass_scaled_fp4_mm for naming consistency

please update the argument names to be consistent with cutlass_scaled_mm wherever possible

robertgshaw2-redhat · 2025-01-29T02:55:05Z

vllm/_custom_ops.py

workspace_bytes is unused?

robertgshaw2-redhat · 2025-01-29T02:55:35Z

vllm/_custom_ops.py

probably better to have this in the c++?

robertgshaw2-redhat · 2025-01-29T02:58:30Z

vllm/_custom_ops.py

This should be called scaled_fp4_quant

This should be next to scaled_fp8_quant below

I think we should create output_sf in this function (rather than have it be an argument). This will make the integration code more consistent with scaled_fp8_quant code better and more consistent with

args should be called (input and scale to be consistent with scaled_fp8_quant

robertgshaw2-redhat · 2025-01-29T03:01:19Z

csrc/torch_bindings.cpp

move this next to cutlass_scaled_mm

robertgshaw2-redhat · 2025-01-29T03:01:31Z

csrc/torch_bindings.cpp

move this next to scaled_fp8_quant

robertgshaw2-redhat · 2025-01-29T03:02:17Z

Nice PR! Left some comments on the integration code.

I will leave it to others to review the kernel.

LucasWilkinson · 2025-02-01T00:22:40Z

tests/kernels/test_nvfp4_quant.py

what does the sf postfix stand for?

block scaling factor

This commit adds gemms for NVFP4 datatype and quantization kernels to convert to NVFP4 Co-authored by kahmadian@nvidia.com Co-authored by kaixih@nvidia.com Signed-off-by: Pavani Majety <pmajety@nvidia.com>

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

Correct usage of scaled_fp4_quant to used rounded m / n Signed-off-by: Pavani Majety <pmajety@nvidia.com>

kaixih · 2025-02-05T20:35:05Z

Hi we have decided to extract the fp4 quantization part into a separate PR. This PR will be based on it and only focus on the fp4 gemm.

@pavanimajety @kushanam

trevor-m · 2025-02-05T20:56:10Z

csrc/torch_bindings.cpp

+      "                 Tensor! b, Tensor! block_scale_a,"
+      "                 Tensor! block_scale_b, Tensor! gscale,"
+      "                 Tensor! workspace, int workspace_bytes) -> ()");
+  ops.impl("cutlass_scaled_fp4_mm", torch::kCUDA, &cutlass_scaled_fp4_mm);


I think we are missing a header definition for these in csrc/ops.h. I'm getting this compiler error:

/opt/vllm/vllm-src/csrc/torch_bindings.cpp: In function ‘void TORCH_LIBRARY_init__C(torch::Library&)’: /opt/vllm/vllm-src/csrc/torch_bindings.cpp:390:52: error: ‘cutlass_scaled_fp4_mm’ was not declared in this scope; did you mean ‘cutlass_scaled_mm’? 390 | ops.impl("cutlass_scaled_fp4_mm", torch::kCUDA, &cutlass_scaled_fp4_mm); | ^~~~~~~~~~~~~~~~~~~~~ | cutlass_scaled_mm /opt/vllm/vllm-src/csrc/torch_bindings.cpp:397:47: error: ‘scaled_fp4_quant’ was not declared in this scope; did you mean ‘static_scaled_fp8_quant’? 397 | ops.impl("scaled_fp4_quant", torch::kCUDA, &scaled_fp4_quant); | ^~~~~~~~~~~~~~~~ | static_scaled_fp8_quant

jiawenliu64 · 2025-02-13T06:03:15Z

csrc/quantization/fp4/gemm_configs.h

+  ChooseWithHeuristic,
+
+  // CTA configs for M=128
+  CtaShape128x128x64B,


How to select those CTA configs and the following ClusterShape? Any reasons behind those selection to achieve the best performance for various M, N, K shapes? Curious if only three CTA configs can already achieve the best performance with sm100?

@pavanimajety @kaixih @ @kushanam @LucasWilkinson

mergify · 2025-02-13T06:04:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @pavanimajety.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the ci/build label Jan 28, 2025

pavanimajety mentioned this pull request Jan 28, 2025

[Kernel] Add ModelOpt FP4 Checkpoint Support #12520

Merged

1 task

mgoin requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth January 28, 2025 18:54

pavanimajety commented Jan 28, 2025

View reviewed changes

tests/kernels/test_nvfp4_gemm.py Outdated

Copy link

Collaborator Author

pavanimajety Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs correction when m is < 128

robertgshaw2-redhat reviewed Jan 28, 2025

View reviewed changes

vllm/_custom_ops.py Outdated Show resolved Hide resolved

pavanimajety force-pushed the blackwell-fp4-gemms-ckpt branch 2 times, most recently from c0445c0 to af8205f Compare January 28, 2025 22:41

robertgshaw2-redhat reviewed Jan 29, 2025

View reviewed changes

vllm/_custom_ops.py Outdated

Copy link

Collaborator

robertgshaw2-redhat Jan 29, 2025 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workspace_bytes is unused?

robertgshaw2-redhat reviewed Jan 29, 2025

View reviewed changes

vllm/_custom_ops.py Outdated

Copy link

Collaborator

robertgshaw2-redhat Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to have this in the c++?

robertgshaw2-redhat reviewed Jan 29, 2025

View reviewed changes

csrc/torch_bindings.cpp Outdated

Copy link

Collaborator

robertgshaw2-redhat Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this next to cutlass_scaled_mm

robertgshaw2-redhat reviewed Jan 29, 2025

View reviewed changes

csrc/torch_bindings.cpp Outdated

Copy link

Collaborator

robertgshaw2-redhat Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this next to scaled_fp8_quant

LucasWilkinson reviewed Feb 1, 2025

View reviewed changes

pavanimajety added 3 commits February 3, 2025 17:53

[NVFP4] Add Kernel Support

1169d9d

This commit adds gemms for NVFP4 datatype and quantization kernels to convert to NVFP4 Co-authored by kahmadian@nvidia.com Co-authored by kaixih@nvidia.com Signed-off-by: Pavani Majety <pmajety@nvidia.com>

Fix mypy error

051727b

Signed-off-by: Pavani Majety <pmajety@nvidia.com>

Address feedback 1/N

fdcf219

Correct usage of scaled_fp4_quant to used rounded m / n Signed-off-by: Pavani Majety <pmajety@nvidia.com>

pavanimajety force-pushed the blackwell-fp4-gemms-ckpt branch from af8205f to fdcf219 Compare February 4, 2025 01:56

kaixih mentioned this pull request Feb 5, 2025

[NVIDIA] Support nvfp4 quantization #12784

Merged

trevor-m reviewed Feb 5, 2025

View reviewed changes

jiawenliu64 reviewed Feb 13, 2025

View reviewed changes

mergify bot added the needs-rebase label Feb 13, 2025

kaixih mentioned this pull request Feb 19, 2025

[NVIDIA] Support nvfp4 cutlass gemm #13571

Merged

pavanimajety closed this Feb 27, 2025

Uh oh!

[Kernel] Add Kernel Support for NVFP4 #12519

[Kernel] Add Kernel Support for NVFP4 #12519

Uh oh!

Conversation

pavanimajety commented Jan 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 28, 2025

Uh oh!

pavanimajety Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 28, 2025

Uh oh!

Uh oh!

robertgshaw2-redhat Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jan 29, 2025

Uh oh!

LucasWilkinson Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Feb 1, 2025

Choose a reason for hiding this comment

Uh oh!

kaixih commented Feb 5, 2025

Uh oh!

trevor-m Feb 5, 2025

Choose a reason for hiding this comment

Uh oh!

jiawenliu64 Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pavanimajety commented Jan 28, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat Jan 29, 2025 •

edited

Loading

robertgshaw2-redhat Jan 29, 2025 •

edited

Loading

robertgshaw2-redhat Jan 29, 2025 •

edited

Loading

jiawenliu64 Feb 13, 2025 •

edited

Loading