Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logs metrics on all distributed processes when using DPO & FSDP #1160

Merged
merged 1 commit into from
Jan 15, 2024

Conversation

AjayP13
Copy link
Contributor

@AjayP13 AjayP13 commented Dec 31, 2023

Logs metrics on all distributed processes when using FSDP.

All processes need access to the metrics in order for things like "Early Stopping" / "Load Best Model At End" to work. You get a KeyError on workers #1 - #N, otherwise.

There is no need for this guard as the logging callbacks themselves check if they are the main process before printing metrics to the terminal.

Log metrics on all distributed processes
@lvwerra
Copy link
Member

lvwerra commented Jan 4, 2024

cc @kashif

@lvwerra lvwerra added the 🏋 DPO Related to DPO label Jan 4, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect classic DPO on multi-GPU + distributed without FSDP? If not, I think we can safely merge it, would you be able to quickly try that out @AjayP13 ? 🙏

@AjayP13
Copy link
Contributor Author

AjayP13 commented Jan 8, 2024

@younesbelkada Yes, I tested this distributed with FSDP enabled and distributed with FSDP disabled (DDP, I believe) and both seemed to work fine.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @AjayP13 !

@younesbelkada younesbelkada requested a review from kashif January 9, 2024 05:45
Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @kashif wdyt? should be all good I would say no?

@kashif
Copy link
Collaborator

kashif commented Jan 15, 2024

yes all good!

@younesbelkada
Copy link
Contributor

Thanks @kashif @AjayP13 !

@younesbelkada younesbelkada merged commit 97b9fa2 into huggingface:main Jan 15, 2024
9 checks passed
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
Log metrics on all distributed processes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏋 DPO Related to DPO
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants