- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of python collect_env.py
vLLM API server version 0.11.1rc2.dev200+g250fb1b8e.d20251021
🐛 Describe the bug
When loading a NVFP4A16 quantized model we get a spurious warning
(Worker_TP0 pid=201) WARNING 10-24 09:50:14 [marlin_utils_fp4.py:137] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
This is using 2x RTX Pro 6000, which do support Fp4.
That warning is coming from
vllm/vllm/model_executor/layers/quantization/utils/marlin_utils_fp4.py
Lines 136 to 142 in 3567816
| def prepare_fp4_layer_for_marlin(layer: torch.nn.Module) -> None: | |
| logger.warning_once( | |
| "Your GPU does not have native support for FP4 computation but " | |
| "FP4 quantization is being used. Weight-only FP4 compression will " | |
| "be used leveraging the Marlin kernel. This may degrade " | |
| "performance for compute-heavy workloads." | |
| ) | 
I believe that warning was supposed to be removed when the emulation path for NVFP4A16 was removed in #18000. If so I can submit a simple PR to remove that part.
Otherwise does it mean there is a new marlin kernel planned that avoids software-dequantizing of NVFP4?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working