Skip to content

Conversation

@raulchen
Copy link
Contributor

@raulchen raulchen commented Jan 7, 2026

Summary

  • Fix gradient flow bug when reusing adapter slots after deletion
  • Rename update_adapter_config → init_lora_adapter and clear_adapter_config → clear_lora_adapter

Problem

When an adapter slot is cleared and reused, both lora_A and lora_B are zero, resulting in zero gradients during training.

Solution

init_lora_adapter now fully initializes the adapter (lora_A with he_uniform, lora_B with zeros) instead of assuming lora_A ha
s valid random values.

Test plan

  • Added test_adapter_reuse_initializes_lora_adapter to verify proper initialization on slot reuse
  • Run existing LoRA training tests

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical bug where LoRA adapter weights (lora_A) were not reinitialized upon reuse, causing training failures. The solution is robust: lora_A is now correctly reinitialized using he_uniform, and lora_B is explicitly zeroed out. The function update_adapter_config has been aptly renamed to reinitialize_adapter to better reflect its new, comprehensive functionality.

The necessary changes to support this fix, such as storing rngs on the model and updating the ModelForCausalLM protocol, have been implemented correctly. Additionally, the change to prevent loading rngs from checkpoints in load_safetensors is a valuable improvement for model state management.

The new test, test_adapter_reuse_reinitializes_lora_adapter, is well-constructed and thoroughly validates the fix by simulating the exact scenario of adapter reuse that was causing the issue.

Overall, this is a high-quality contribution that resolves a significant issue and improves the codebase's clarity and correctness. The changes are clean, well-tested, and follow best practices.

@pcmoritz pcmoritz added the tx label Jan 7, 2026
service_client, training_client = create_service_and_training_client(
base_url=base_url,
skip_verify=True,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also de-flakey this test that was also introduced in the last PR

@pcmoritz
Copy link
Collaborator

pcmoritz commented Jan 8, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical bug where reusing LoRA adapter slots would lead to zero gradients due to uninitialized weights. The core fix, which re-initializes lora_A with he_uniform within the newly renamed init_lora_adapter function, is well-implemented and correct. The accompanying changes, such as renaming functions for clarity and ensuring the model's RNGs are accessible, are logical and well-executed. The addition of a dedicated test case, test_adapter_reuse_initializes_lora_adapter, is excellent as it directly validates the fix. I've included a few minor suggestions to improve code consistency and readability. Overall, this is a high-quality contribution.

pcmoritz and others added 7 commits January 7, 2026 17:31
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@pcmoritz pcmoritz changed the title [tx] Fix lora_A not reinitialized after clearing lora adapter [tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig Jan 8, 2026
@pcmoritz
Copy link
Collaborator

pcmoritz commented Jan 8, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical bug where LoRA adapter weights were not re-initialized upon reuse, which could halt training. The solution of moving the weight initialization into the newly renamed init_lora_adapter function is correct and well-implemented. The renaming of functions improves code clarity, and the addition of a new test case, test_adapter_reuse_initializes_lora_adapter, is excellent for ensuring this bug does not regress. Furthermore, adding seed to LoraConfig for reproducibility is a great enhancement. I have a couple of minor suggestions to further improve code style and maintainability, but overall, this is a high-quality change.

pcmoritz and others added 2 commits January 7, 2026 18:43
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@pcmoritz pcmoritz merged commit 9796de6 into NovaSky-AI:main Jan 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants