-
Notifications
You must be signed in to change notification settings - Fork 16
Description
🐛 Describe the bug
Currently, in grpo/main, when both the trainer
and reference model (ref_model
) are configured with checkpoints but without explicitly specifying folder
, both default to ./checkpoint
.
Example configuration:
trainer:
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true
last_save_in_hf: true
interval: 500
async_mode: "disabled"
ref_model:
checkpoint:
enable: true
initial_load_path: hf://${model}
initial_load_in_hf: true
Neither of trainer or ref_model specifies the folder
. The default value is ./checkpoint
. So the above config is equivalent to:
trainer.checkpoint.folder: ./checkpoint
ref_model.checkpoint.folder: ./checkpoint
If the trainer has saved checkpoints under ./checkpoint
, Titan’s loading logic will ignore initial_load_path
and instead automatically pick up the latest local checkpoint (e.g., ./checkpoint/step-x
, where x
is largest number it can find ).
As a result, the reference model unintentionally loads the trainer’s checkpoint weights instead of its own intended HF initialization.
Proposed Fix
Hardcode the reference model’s checkpoint folder to an empty string (folder=""
) in reference_model.py
to ensure it bypasses Titan’s folder existence check and always loads from initial_load_path
.
Alternative Plan
Alternatively, enforce a validation rule such that:
- Instead of hardcoding, make the behavior explicit and user-configurable.
ref_model:
checkpoint:
enable: true
folder: "" # Empty folder = force load from initial_load_path
initial_load_path: hf://${model}
initial_load_in_hf: true
load_step: -1 # Optional: allow user to load a specific checkpoint if desired
Although it introduced complexity and I don't see this flexibility is very needed in practice.
- or issue a warning when both trainer and ref_model share the same checkpoint folder. But it can be easily ignored.
Versions
No response