-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
@erhoo82 just remembered we want to get rid of the |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
This PR was closed because it has been inactive for 7 days since being marked as stale. |
Somehow this was closed. Re-opend. |
@athitten |
b23e471
to
419ad62
Compare
Added some changes. |
jenkins |
@ericharper @athitten |
@@ -53,6 +53,7 @@ def _training_strategy(self) -> NLPDDPStrategy: | |||
no_ddp_communication_hook=True, | |||
gradient_as_bucket_view=self.cfg.model.gradient_as_bucket_view, | |||
find_unused_parameters=False, | |||
sharp=cfg.model.get('sharp', False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erhoo82 probably a typo. It should be self.cfg.model.get
right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks the running of this PR
Yes should be okay to merge once the CI passes. Also, there are some conflicts with the base branch. |
70343d8
to
2065563
Compare
Thanks! |
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
jenkins |
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
…st rank Signed-off-by: Sangkug Lym <slym@nvidia.com> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <slym@nvidia.com> cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com>
0ef3363
to
f4b5515
Compare
f4b5515
to
c12cee7
Compare
jenkins |
…ss to the first rank instead of b-cast (NVIDIA#7793) * (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank Signed-off-by: Sangkug Lym <slym@nvidia.com> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <slym@nvidia.com> cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> --------- Signed-off-by: Sangkug Lym <slym@nvidia.com> Signed-off-by: Piotr Żelasko <petezor@gmail.com>
…ss to the first rank instead of b-cast (NVIDIA#7793) * (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank Signed-off-by: Sangkug Lym <slym@nvidia.com> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <slym@nvidia.com> cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> --------- Signed-off-by: Sangkug Lym <slym@nvidia.com> Signed-off-by: Sasha Meister <ameister@nvidia.com>
…ss to the first rank instead of b-cast (NVIDIA#7793) * (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank Signed-off-by: Sangkug Lym <slym@nvidia.com> Add a default SHARP setting to arg list Signed-off-by: Sangkug Lym <slym@nvidia.com> cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by: Sangkug Lym <slym@nvidia.com> --------- Signed-off-by: Sangkug Lym <slym@nvidia.com>
What does this PR do ?
(1) Add SHARP interface to M-CORE in the communicator initialization.
(2) use send/recv to send train loss to the first rank instead of b-cast. This mitigates the communication overhead.
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information