Skip to content

Conversation

@leejnau
Copy link
Contributor

@leejnau leejnau commented Sep 25, 2025

Purpose

Fix a bug wherein the dictionary key was incorrect for the Hugging Face config.json quantization layers to be ignored (non-quantized layers). For the legacy hf_quant_config.json file, the key is "exclude_modules". For the more modern in-place config.json the key is "ignore".

Test Plan

Tested with the following models with the prompt
'<s> The capital of France is Paris. The capital of the United States is Washington, D.C.'

nvidia/Qwen3-30B-A3B-FP4
nvidia/Phi-4-reasoning-plus-FP4
nvidia/Llama-3.1-8B-Instruct-FP8
nvidia/Phi-4-multimodal-instruct-FP8
RedHatAI/phi-4-FP8-dynamic
RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic

Test Result

All models loaded and ran successfully, producing reasonable output:

nvidia/Qwen3-30B-A3B-FP4 : <s> The capital of France is Paris. The capital of the United States is Washington, D.C. The capital of Brazil is Brasília. The capital of Canada is Ottawa. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Japan is Tokyo. The capital of South Korea is Seoul. The capital of the United Kingdom

nvidia/Phi-4-reasoning-plus-FP4 : <s> The capital of France is Paris. The capital of the United States is Washington, D.C. I. I. .... ........................................

nvidia/Llama-3.1-8B-Instruct-FP8 : <s> The capital of France is Paris. The capital of the United States is Washington, D.C. The capital of the United Kingdom is London. The capital of China is Beijing. The capital of Japan is Tokyo. The capital of India is New Delhi. The capital of Brazil is Brasília. The capital of Russia is Moscow. The capital of Canada

nvidia/Phi-4-multimodal-instruct-FP8 : <s> The capital of France is Paris. The capital of the United States is Washington, D.C. The capital of Japan is Tokyo. The capital of Australia is Canberra. The capital of Brazil is Brasília. The capital of India is New Delhi. The capital of Canada is Ottawa. The capital of Germany is Berlin. The capital of Italy is Rome.

RedHatAI/phi-4-FP8-dynamic : <s> The capital of France is Paris. The capital of the United States is Washington, D.C. The capital of Japan is Tokyo. The capital of Brazil is Brasilia. The capital of Australia is Canberra. The capital of Canada is Ottawa. The capital of India is New Delhi. The capital of China is Beijing. The capital of Russia is Moscow

RedHatAI/Apertus-8B-Instruct-2509-FP8-dynamic : <s> The capital of France is Paris. The capital of the United States is Washington, D.C. The capital of Canada is Ottawa. The capital of Australia is Canberra. The capital of Brazil is Brasília. The capital of Mexico is Mexico City. The capital of Germany is Berlin. The capital of the United Kingdom is London. The capital of Italy


Signed-off-by: Lee Nau <lnau@nvidia.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a bug in parsing quantization configurations by using the ignore key instead of exclude_modules for modern Hugging Face config.json files. The changes are applied to both ModelOptFp8Config and ModelOptNvFp4Config. My review focuses on improving the robustness of this configuration parsing. I've suggested falling back to the legacy key if the new key is not present to prevent silent failures and improve user experience.

kv_cache_quant_method = config.get("kv_cache_quant_algo")
exclude_modules = config.get("exclude_modules")
# "ignore" is the key in config.json
exclude_modules = config.get("ignore")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For better robustness and to avoid silent failures, it's a good practice to support both the new key (ignore) and the legacy key (exclude_modules), with the new key taking precedence. This prevents issues if a user provides a modern config but mistakenly uses the old key.

Suggested change
exclude_modules = config.get("ignore")
exclude_modules = config.get("ignore", config.get("exclude_modules"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adopt this suggestion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, are there conditions under which the old key would be used in the new file? Maybe it would be better to enforce only the new key in the new file?

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handle the fallback case for hf_quant_config.json, Is it not handled here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback case is actually in the initial if-statment. The presence of the key name "quantization" is the condition in that initial if-statement. That key only exists in the legacy hf_quant_config.json file. For instance:
https://huggingface.co/nvidia/Qwen3-30B-A3B-FP4/blob/main/hf_quant_config.json#L6

The key for quantization in the config.json file is "quantization_config". For instance: https://huggingface.co/nvidia/Qwen3-30B-A3B-FP4/blob/main/config.json#L38

So the existing logic here is entirely based upon the differing key names in those two files (hf_quant_config.json and config.json).

I tried to indicate this with the comments I left in the code above the key checks.


exclude_modules = config.get("exclude_modules", [])
# "ignore" is the key in config.json
exclude_modules = config.get("ignore", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To improve robustness and prevent silent configuration errors, consider supporting both the new ignore key and the legacy exclude_modules key. Prioritizing ignore while falling back to exclude_modules ensures that user configurations with the old key in a modern format are still handled correctly.

Suggested change
exclude_modules = config.get("ignore", [])
exclude_modules = config.get("ignore", config.get("exclude_modules", []))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adopt this suggestion.

Copy link
Collaborator

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!
If you have the logs for the results, please add them to PR desc.

kv_cache_quant_method = config.get("kv_cache_quant_algo")
exclude_modules = config.get("exclude_modules")
# "ignore" is the key in config.json
exclude_modules = config.get("ignore")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adopt this suggestion.


exclude_modules = config.get("exclude_modules", [])
# "ignore" is the key in config.json
exclude_modules = config.get("ignore", [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's adopt this suggestion.

@DarkLight1337 DarkLight1337 requested review from Isotr0py and hmellor and removed request for mgoin September 26, 2025 07:23
Comment on lines +728 to 729
# "exclude_modules" is the key in the legacy hf_quant_config.json
exclude_modules = quant_config.get("exclude_modules", [])
Copy link
Member

@Isotr0py Isotr0py Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# "exclude_modules" is the key in the legacy hf_quant_config.json
exclude_modules = quant_config.get("exclude_modules", [])
# "exclude_modules" is the key in the legacy hf_quant_config.json
exclude_modules = quant_config.get("ignore", config.get("exclude_modules", []))

I think we should also modify this line.

@cjackal
Copy link
Contributor

cjackal commented Sep 29, 2025

This PR would fix most FP8 checkpoints including meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8.

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025
@mgoin
Copy link
Member

mgoin commented Sep 29, 2025

@leejnau @cjackal I don't understand why this would affect non-modelopt checkpoints? The meta-llama and RedHatAI fp8 checkpoints use https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py

@mgoin mgoin merged commit d5ab285 into vllm-project:main Sep 29, 2025
61 checks passed
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
#25706)

Signed-off-by: Lee Nau <lnau@nvidia.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
vllm-project#25706)

Signed-off-by: Lee Nau <lnau@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
vllm-project#25706)

Signed-off-by: Lee Nau <lnau@nvidia.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
vllm-project#25706)

Signed-off-by: Lee Nau <lnau@nvidia.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants