[BUG] RefModel incorrectly reuses trainer checkpoint due to shared default `./checkpoint` folder

### 🐛 Describe the bug

Currently, in grpo/main, when both the `trainer` and reference model (`ref_model`) are configured with checkpoints but without explicitly specifying `folder`, both default to `./checkpoint`.

Example configuration:

```yaml
trainer:
  checkpoint:
    enable: true
    initial_load_path: hf://${model}
    initial_load_in_hf: true
    last_save_in_hf: true
    interval: 500
    async_mode: "disabled"

ref_model:
  checkpoint:
    enable: true
    initial_load_path: hf://${model}
    initial_load_in_hf: true
```

Neither of trainer or ref_model specifies the `folder`. [The default value is `./checkpoint`](https://github.com/pytorch/torchtitan/blob/7c1048075bd7b47edbed6db50986c8e7e7492b0a/torchtitan/config/job_config.py#L449). So the above config is equivalent to:

```yaml
trainer.checkpoint.folder: ./checkpoint
ref_model.checkpoint.folder: ./checkpoint
```

If the trainer has saved checkpoints under `./checkpoint`, Titan’s loading logic will [**ignore `initial_load_path`**](https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/checkpoint.py#L600-L604) and instead automatically [pick up the latest local checkpoint](https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/checkpoint.py#L610) (e.g., `./checkpoint/step-x`, where `x` is largest number it can find ).

As a result, the reference model unintentionally loads the **trainer’s checkpoint weights** instead of its own intended HF initialization.

### **Proposed Fix**

Hardcode the reference model’s checkpoint folder to an empty string (`folder=""`) in [`reference_model.py`](https://github.com/meta-pytorch/forge/blob/9782f57c8919ae93bfdff073c81e2da4b27743bf/src/forge/actors/reference_model.py#L40) to ensure it bypasses Titan’s folder existence check and always loads from `initial_load_path`.


### **Alternative Plan**
Alternatively, enforce a validation rule such that:

* Instead of hardcoding, make the behavior explicit and user-configurable.
```yaml
ref_model:
  checkpoint:
    enable: true
    folder: ""            # Empty folder = force load from initial_load_path
    initial_load_path: hf://${model}
    initial_load_in_hf: true
    load_step: -1         # Optional: allow user to load a specific checkpoint if desired
```
Although it introduced complexity and I don't see this flexibility is very needed in practice. 
* or issue a warning when both trainer and ref_model share the same checkpoint folder. But it can be easily ignored. 


### Versions

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] RefModel incorrectly reuses trainer checkpoint due to shared default `./checkpoint` folder #413

🐛 Describe the bug

Proposed Fix

Alternative Plan

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] RefModel incorrectly reuses trainer checkpoint due to shared default ./checkpoint folder #413

Description

🐛 Describe the bug

Proposed Fix

Alternative Plan

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] RefModel incorrectly reuses trainer checkpoint due to shared default `./checkpoint` folder #413