Skip to content

Conversation

DNXie
Copy link
Member

@DNXie DNXie commented Oct 14, 2025

Solving #413

After this, the checkpoint for reference model will be load only from initial_load_path. User still have the flexibility to change ref model by modifying initial_load_path.

Tested on grpo/main. ref_model now will always load from the initial_load_path regardless of trainer's checkpoint config.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2025
@DNXie DNXie marked this pull request as ready for review October 14, 2025 22:58
self.engine = ForgeEngine(ForgeJobConfig(**engine_config))
engine_config = ForgeJobConfig(**engine_config)
engine_config.checkpoint.folder = (
"" # hardcode to empty to force load from initial_load_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does ref model get the initial weights after the change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So maybe add a validator on the ref model's config and only allow fields that differs per model. for example, Also put enable: true and initial_load_in_hf: true in the override together with the folder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a todo to make this configurable in the future? Sometimes you do want to update your reference model just very infrequently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JenniferWang that would have to be on the titan side right? Also, I think this module will be reusable for any "forward pass" model in the near future and not just for reference model. So I don't know if it's worth doing anything more than this now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this not just be accomplished by setting this in the config?

  checkpoint:
    enable: true
    initial_load_path: hf://${model}
    folder: "" 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@allenwang28 Yes, it is the same. But I thought it would be a little confusing to the users.
@pbontrager If user want to change the reference model, they can just update the initial_load_path in the config. Does it resolve some of your concern?

@DNXie DNXie merged commit 9afb769 into meta-pytorch:main Oct 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants