[tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig #852

raulchen · 2026-01-07T20:59:34Z

Summary

Fix gradient flow bug when reusing adapter slots after deletion
Rename update_adapter_config → init_lora_adapter and clear_adapter_config → clear_lora_adapter

Problem

When an adapter slot is cleared and reused, both lora_A and lora_B are zero, resulting in zero gradients during training.

Solution

init_lora_adapter now fully initializes the adapter (lora_A with he_uniform, lora_B with zeros) instead of assuming lora_A ha
s valid random values.

Test plan

Added test_adapter_reuse_initializes_lora_adapter to verify proper initialization on slot reuse
Run existing LoRA training tests

gemini-code-assist

Code Review

This pull request effectively addresses a critical bug where LoRA adapter weights (lora_A) were not reinitialized upon reuse, causing training failures. The solution is robust: lora_A is now correctly reinitialized using he_uniform, and lora_B is explicitly zeroed out. The function update_adapter_config has been aptly renamed to reinitialize_adapter to better reflect its new, comprehensive functionality.

The necessary changes to support this fix, such as storing rngs on the model and updating the ModelForCausalLM protocol, have been implemented correctly. Additionally, the change to prevent loading rngs from checkpoints in load_safetensors is a valuable improvement for model state management.

The new test, test_adapter_reuse_reinitializes_lora_adapter, is well-constructed and thoroughly validates the fix by simulating the exact scenario of adapter reuse that was causing the issue.

Overall, this is a high-quality contribution that resolves a significant issue and improves the codebase's clarity and correctness. The changes are clean, well-tested, and follow best practices.

skyrl-tx/tx/layers/lora.py

raulchen · 2026-01-07T23:12:08Z

skyrl-tx/tests/tinker/test_api.py

+    service_client, training_client = create_service_and_training_client(
+        base_url=base_url,
+        skip_verify=True,
+    )


also de-flakey this test that was also introduced in the last PR

pcmoritz · 2026-01-08T01:12:03Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively resolves a critical bug where reusing LoRA adapter slots would lead to zero gradients due to uninitialized weights. The core fix, which re-initializes lora_A with he_uniform within the newly renamed init_lora_adapter function, is well-implemented and correct. The accompanying changes, such as renaming functions for clarity and ensuring the model's RNGs are accessible, are logical and well-executed. The addition of a dedicated test case, test_adapter_reuse_initializes_lora_adapter, is excellent as it directly validates the fix. I've included a few minor suggestions to improve code consistency and readability. Overall, this is a high-quality contribution.

skyrl-tx/tests/tinker/test_jax_backend.py

skyrl-tx/tx/layers/lora.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

pcmoritz · 2026-01-08T02:40:04Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively addresses a critical bug where LoRA adapter weights were not re-initialized upon reuse, which could halt training. The solution of moving the weight initialization into the newly renamed init_lora_adapter function is correct and well-implemented. The renaming of functions improves code clarity, and the addition of a new test case, test_adapter_reuse_initializes_lora_adapter, is excellent for ensuring this bug does not regress. Furthermore, adding seed to LoraConfig for reproducibility is a great enhancement. I have a couple of minor suggestions to further improve code style and maintainability, but overall, this is a high-quality change.

skyrl-tx/tx/layers/lora.py

skyrl-tx/tx/tinker/api.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

raulchen added 2 commits January 7, 2026 12:40

fix clear_adapter_config

aad9ee2

add test

6b7b30d

raulchen mentioned this pull request Jan 7, 2026

[tx] implement model unloading #844

Merged

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

raulchen added 2 commits January 7, 2026 13:05

rename

1bdfba6

comment

7ceef52

pcmoritz added the tx label Jan 7, 2026

pcmoritz reviewed Jan 7, 2026

View reviewed changes

skyrl-tx/tx/layers/lora.py Outdated Show resolved Hide resolved

raulchen added 3 commits January 7, 2026 14:53

init to 0

20f88be

bump timeout

6fa8789

skip_verify

d6674e3

raulchen commented Jan 7, 2026

View reviewed changes

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

skyrl-tx/tests/tinker/test_jax_backend.py Outdated Show resolved Hide resolved

skyrl-tx/tests/tinker/test_jax_backend.py Outdated Show resolved Hide resolved

skyrl-tx/tx/layers/lora.py Show resolved Hide resolved

pcmoritz and others added 7 commits January 7, 2026 17:31

Update skyrl-tx/tests/tinker/test_jax_backend.py

d4e8492

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update skyrl-tx/tests/tinker/test_jax_backend.py

e798327

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

update

e802f00

update

b0fcb23

update

a0f7903

update

8f0b1be

update

b6cded0

pcmoritz changed the title ~~[tx] Fix lora_A not reinitialized after clearing lora adapter~~ [tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig Jan 8, 2026

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

skyrl-tx/tx/layers/lora.py Show resolved Hide resolved

skyrl-tx/tx/tinker/api.py Outdated Show resolved Hide resolved

pcmoritz and others added 2 commits January 7, 2026 18:43

Update skyrl-tx/tx/tinker/api.py

317f7c4

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

update

28c4f89

pcmoritz approved these changes Jan 8, 2026

View reviewed changes

pcmoritz added 2 commits January 7, 2026 19:04

update

51c0e07

update

dbb90a1

pcmoritz merged commit 9796de6 into NovaSky-AI:main Jan 8, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig #852

[tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig #852

Uh oh!

raulchen commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

raulchen Jan 7, 2026

Uh oh!

pcmoritz commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig #852

[tx] Fix lora_A not reinitialized after clearing lora adapter and support seeds in LoraConfig #852

Uh oh!

Conversation

raulchen commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

raulchen Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raulchen commented Jan 7, 2026 •

edited

Loading