Skip to content

[BUG] RefModel incorrectly reuses trainer checkpoint due to shared default ./checkpoint folder #413

@DNXie

Description

@DNXie

🐛 Describe the bug

Currently, in grpo/main, when both the trainer and reference model (ref_model) are configured with checkpoints but without explicitly specifying folder, both default to ./checkpoint.

Example configuration:

trainer:
  checkpoint:
    enable: true
    initial_load_path: hf://${model}
    initial_load_in_hf: true
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

ref_model:
  checkpoint:
    enable: true
    initial_load_path: hf://${model}
    initial_load_in_hf: true

Neither of trainer or ref_model specifies the folder. The default value is ./checkpoint. So the above config is equivalent to:

trainer.checkpoint.folder: ./checkpoint
ref_model.checkpoint.folder: ./checkpoint

If the trainer has saved checkpoints under ./checkpoint, Titan’s loading logic will ignore initial_load_path and instead automatically pick up the latest local checkpoint (e.g., ./checkpoint/step-x, where x is largest number it can find ).

As a result, the reference model unintentionally loads the trainer’s checkpoint weights instead of its own intended HF initialization.

Proposed Fix

Hardcode the reference model’s checkpoint folder to an empty string (folder="") in reference_model.py to ensure it bypasses Titan’s folder existence check and always loads from initial_load_path.

Alternative Plan

Alternatively, enforce a validation rule such that:

  • Instead of hardcoding, make the behavior explicit and user-configurable.
ref_model:
  checkpoint:
    enable: true
    folder: ""            # Empty folder = force load from initial_load_path
    initial_load_path: hf://${model}
    initial_load_in_hf: true
    load_step: -1         # Optional: allow user to load a specific checkpoint if desired

Although it introduced complexity and I don't see this flexibility is very needed in practice.

  • or issue a warning when both trainer and ref_model share the same checkpoint folder. But it can be easily ignored.

Versions

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions