Skip to content

[Bug]: NVFP4A16 spurious warning that GPU doesn't support Fp4 #27471

@mratsim

Description

@mratsim

Your current environment

The output of python collect_env.py
vLLM API server version 0.11.1rc2.dev200+g250fb1b8e.d20251021

🐛 Describe the bug

When loading a NVFP4A16 quantized model we get a spurious warning

(Worker_TP0 pid=201) WARNING 10-24 09:50:14 [marlin_utils_fp4.py:137] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.

This is using 2x RTX Pro 6000, which do support Fp4.

That warning is coming from

def prepare_fp4_layer_for_marlin(layer: torch.nn.Module) -> None:
logger.warning_once(
"Your GPU does not have native support for FP4 computation but "
"FP4 quantization is being used. Weight-only FP4 compression will "
"be used leveraging the Marlin kernel. This may degrade "
"performance for compute-heavy workloads."
)

I believe that warning was supposed to be removed when the emulation path for NVFP4A16 was removed in #18000. If so I can submit a simple PR to remove that part.

Otherwise does it mean there is a new marlin kernel planned that avoids software-dequantizing of NVFP4?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions