Mark invariant normalizer in Gemma as non-persistent #19788

yhtang · 2025-06-18T04:40:30Z

Prevents the Gemma embedding normalizer, which is a model invariant, from being scrambled by the dummy weight loader.

This is a quick fix for the problem when a vLLM Gemma model keeps producing gibberish in an RLHF setup (similar to the RLHF example) if it was initially loaded with dummy weights, even after we send in the real weights. It is due to the dummy weight loader traversing state_dict rather than named_parameters and inadvertently randomizing the normalizer, which is a model invariant rather than a trainable weight.

A quick scan of the model registry indicates that other models register similar invariant buffers with persistent=False, thus not triggering this issue.

A long-term solution would be to provide an API for marking a buffer as invariant, e.g.

def register_invariant(self, name, tensor):
    tensor._no_dummy_init = True
    self.register_buffer(name, tensor)

and have initialize_dummy_weights respect that flag.

github-actions · 2025-06-18T04:40:38Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @yhtang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a targeted fix for an issue in vLLM's Gemma model implementations where the model's invariant normalizer was being corrupted by the dummy weight initialization process. By marking this specific buffer as non-persistent, the change ensures that the model can correctly load and utilize real weights after an initial dummy load, resolving a problem that caused erroneous outputs in certain configurations like RLHF setups.

Highlights

Gemma Model Initialization Fix: Addresses a critical issue where the Gemma model's normalizer buffer, an invariant component, was being inadvertently scrambled by the dummy weight loader. This led to incorrect model behavior (producing gibberish) even after real weights were loaded, particularly in RLHF setups.
Marking Invariant Buffers as Non-Persistent: Explicitly sets persistent=False when registering the normalizer buffer in the __init__ methods of GemmaModel, Gemma2Model, and Gemma3Model. This prevents the dummy weight loader from modifying this invariant buffer, ensuring correct model functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses an issue where the Gemma embedding normalizer, an invariant part of the model, was being incorrectly modified by the dummy weight loader. This led to models producing incorrect outputs after being loaded with dummy weights and then real weights in RLHF setups.

The fix involves setting persistent=False when registering the normalizer buffer in GemmaModel, Gemma2Model, and Gemma3Model. This is the correct approach because buffers with persistent=False are not included in the model's state_dict(). Since the dummy weight loader (as pointed out in the description) iterates over state_dict(), this change effectively shields the normalizer from being altered by it.

The change is consistent across all affected Gemma model files and aligns with the practice in other vLLM models for handling invariant buffers. The PR description clearly explains the problem, the rationale for the fix, and also thoughtfully proposes a more robust long-term solution for handling model invariants, which is commendable.

Overall, this is a well-targeted and correct quick fix for the described problem. The code changes are minimal and directly address the issue.

yhtang · 2025-06-18T07:36:47Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses an issue where Gemma models produce incorrect outputs after being loaded with dummy weights. The root cause was identified as the embedding normalizer, a model invariant, being inadvertently modified by the dummy weight loader because it was registered as a persistent buffer. The fix correctly marks this normalizer as non-persistent (persistent=False) in gemma.py, gemma2.py, and gemma3.py. This change prevents the normalizer from being included in the model's state_dict, thereby protecting it from the dummy weight loading process. This approach aligns with how other models in vLLM handle similar invariant buffers.

A new test suite, test_gemma.py, is introduced to specifically verify this fix. It ensures that the normalizer retains its correct, calculated value even when the model is loaded with load_format="dummy".

My review primarily focuses on the new test, ensuring its robustness and correctness. The core code changes are straightforward and well-justified.

gemini-code-assist · 2025-06-18T07:37:44Z

tests/models/language/generation/test_gemma.py

The relative tolerance rtol=1e-3 seems a bit loose for comparing the normalizer value, which is described as a model invariant and calculated as hidden_size**0.5.

Given that the normalizer is initialized from a direct calculation (torch.tensor(self.config.hidden_size**0.5)) and should not be modified by the dummy loader (due to persistent=False), the value retrieved from the model should be very close to the re-calculated value.

Using a tighter tolerance, for example rtol=1e-5 or rtol=1e-6 (typical for float32 comparisons), would make the test more robust in verifying that the normalizer is indeed untouched and retains its precise original value. If a loose tolerance like 1e-3 is necessary for the test to pass, it might mask subtle issues or indicate unexpected floating-point discrepancies.

Consider tightening the tolerance to ensure the test accurately reflects the invariant nature of the normalizer.

The assertion will fail with rtol=1e-4. This is due to the tensor using lower-precision floats such as bf16.

… being scrambled by the dummy weight loader Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com>

DarkLight1337

Thanks for fixing, please resolve the pre-commit errors

Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com>

yhtang · 2025-06-19T01:17:16Z

Thanks for fixing, please resolve the pre-commit errors

Thanks for reviewing! I've fixed the format errors, but it's now complaining again about a file that is not in this PR.

This PR contains following changes 1. Port Gemma3 SLIDING_WINDOW FusedSDPA feature from habana_main + Add a few extra fixes including.. - Sliding FusedSDPA kernel, we are adding threshold variable to enable or disable to use optimized kernel. This kernel will be performance/memory benefit for longer sequence. We are providing environment variable to control per customer request. - Based on the threshold, choose different prompt bucket, if it's smaller than the threshold, use PROMPT_BUCKET_STEP, otherwise use SLICE_SIZE. - Added mark_step before SLIDING FusedSDPA is run. - Misc fixes for bucket related issue. 2. upstream fixes vllm-project#18732 vllm-project#21479 vllm-project#19788 3. optimized Gemma3RMSNorm with FusedRMSNorm Dependent on #1647 Run command with. VLLM_FUSEDSDPA_SLIDE_THLD=2048 VLLM_EXPONENTIAL_BUCKETING=false VLLM_PROMPT_BS_BUCKET_MAX=64 VLLM_PROMPT_SEQ_BUCKET_STEP=1024 VLLM_PROMPT_SEQ_BUCKET_MAX=20480 PT_HPU_SDPA_QKV_SLICE_MODE_FWD=1 --------- Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Signed-off-by: Hongmin Fan <fanhongmin@google.com> Co-authored-by: Henry Tang <ytang@habana.ai> Co-authored-by: Mohit Deopujari <mdeopujari@habana.ai> Co-authored-by: Shiv Kaul <skaul@habana.ai> Co-authored-by: Shiv Kaul <shiv.kaul@intel.com> Co-authored-by: Libin Tang <libin.tang@intel.com> Co-authored-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: Hongmin Fan <fanhongmin@google.com> Co-authored-by: Harish Subramony <hsubramony@habana.ai> Co-authored-by: Jianhong-Zhang <jianhong.zhang@intel.com> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

gemini-code-assist bot reviewed Jun 18, 2025

View reviewed changes

yhtang force-pushed the yhtang/fix-gemma-normalizer branch from ec21f58 to a99ccb1 Compare June 18, 2025 07:34

yhtang marked this pull request as ready for review June 18, 2025 07:34

yhtang requested review from DarkLight1337 and ywang96 as code owners June 18, 2025 07:34

gemini-code-assist bot reviewed Jun 18, 2025

View reviewed changes

mark invariant normalizer factor as non-persistent to prevent it from…

c390936

… being scrambled by the dummy weight loader Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com>

yhtang force-pushed the yhtang/fix-gemma-normalizer branch from a99ccb1 to c390936 Compare June 18, 2025 07:38

DarkLight1337 approved these changes Jun 18, 2025

View reviewed changes

pre-commit formatter fix

520eda9

Signed-off-by: Yu-Hang Tang <Tang.Maxin@gmail.com>

yhtang force-pushed the yhtang/fix-gemma-normalizer branch from 592d470 to 520eda9 Compare June 19, 2025 00:58

yhtang requested a review from DarkLight1337 June 19, 2025 01:22

DarkLight1337 approved these changes Jun 19, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) June 19, 2025 02:53

DarkLight1337 added ready ONLY add when PR is ready to merge/full CI is needed force-merge labels Jun 19, 2025

vllm-bot merged commit 83ca9ae into vllm-project:main Jun 19, 2025
68 of 71 checks passed

kane-vln mentioned this pull request Jul 5, 2025

[Model] Gemma1/2/3 do not set normalizer as non-persistent in vllm 0.9.1 or lower nvidia-cosmos/cosmos-rl#76

Closed

hsubramony mentioned this pull request Jul 25, 2025

Gemma fixes port from main HabanaAI/vllm-fork#1661

Closed

libinta mentioned this pull request Jul 30, 2025

Gemma3 v1.22 changes (Sliding_Window feature + few others) HabanaAI/vllm-fork#1660

Merged

libinta mentioned this pull request Aug 1, 2025

profile multi-modal model with exact image count HabanaAI/vllm-fork#1699

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Mark invariant normalizer in Gemma as non-persistent #19788

Mark invariant normalizer in Gemma as non-persistent #19788

Uh oh!

yhtang commented Jun 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yhtang commented Jun 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 18, 2025

Uh oh!

yhtang Jun 18, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

yhtang commented Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Mark invariant normalizer in Gemma as non-persistent #19788

Mark invariant normalizer in Gemma as non-persistent #19788

Uh oh!

Conversation

yhtang commented Jun 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yhtang commented Jun 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

yhtang Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

yhtang commented Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhtang commented Jun 18, 2025 •

edited by github-actions bot

Loading