-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Logs metrics on all distributed processes when using DPO & FSDP #1160
Conversation
Log metrics on all distributed processes
cc @kashif |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this affect classic DPO on multi-GPU + distributed without FSDP? If not, I think we can safely merge it, would you be able to quickly try that out @AjayP13 ? 🙏
@younesbelkada Yes, I tested this distributed with FSDP enabled and distributed with FSDP disabled (DDP, I believe) and both seemed to work fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @AjayP13 !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kashif wdyt? should be all good I would say no?
yes all good! |
Log metrics on all distributed processes
Logs metrics on all distributed processes when using FSDP.
All processes need access to the metrics in order for things like "Early Stopping" / "Load Best Model At End" to work. You get a
KeyError
on workers #1 - #N, otherwise.There is no need for this guard as the logging callbacks themselves check if they are the main process before printing metrics to the terminal.