Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(1) Add SHARP interface to M-CORE, (2) use send/recv to send train loss to the first rank instead of b-cast #7793

Merged
merged 2 commits into from
Jan 2, 2024

Conversation

erhoo82
Copy link
Collaborator

@erhoo82 erhoo82 commented Oct 24, 2023

What does this PR do ?

(1) Add SHARP interface to M-CORE in the communicator initialization.
(2) use send/recv to send train loss to the first rank instead of b-cast. This mitigates the communication overhead.

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Oct 24, 2023
ericharper
ericharper previously approved these changes Oct 24, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

athitten
athitten previously approved these changes Oct 24, 2023
Copy link
Collaborator

@athitten athitten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@athitten
Copy link
Collaborator

@erhoo82 just remembered we want to get rid of the torch broadcast in on_validation_epoch_end as well right ? here: torch.distributed.broadcast

Copy link
Contributor

github-actions bot commented Nov 8, 2023

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Nov 8, 2023
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Nov 15, 2023
@erhoo82 erhoo82 reopened this Nov 30, 2023
@erhoo82
Copy link
Collaborator Author

erhoo82 commented Nov 30, 2023

Somehow this was closed. Re-opend.

@erhoo82
Copy link
Collaborator Author

erhoo82 commented Nov 30, 2023

@athitten
We need to keep the torch.distirbuted.broadcast in eval because this is the training termination condition for MLPerf.

@erhoo82
Copy link
Collaborator Author

erhoo82 commented Nov 30, 2023

Added some changes.

@github-actions github-actions bot removed the stale label Dec 1, 2023
@erhoo82
Copy link
Collaborator Author

erhoo82 commented Dec 5, 2023

jenkins

@erhoo82
Copy link
Collaborator Author

erhoo82 commented Dec 8, 2023

@ericharper @athitten
I think there is no issue with this PR right? Can we merge?

@@ -53,6 +53,7 @@ def _training_strategy(self) -> NLPDDPStrategy:
no_ddp_communication_hook=True,
gradient_as_bucket_view=self.cfg.model.gradient_as_bucket_view,
find_unused_parameters=False,
sharp=cfg.model.get('sharp', False),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erhoo82 probably a typo. It should be self.cfg.model.get right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks the running of this PR

@athitten
Copy link
Collaborator

athitten commented Dec 8, 2023

@ericharper @athitten I think there is no issue with this PR right? Can we merge?

Yes should be okay to merge once the CI passes. Also, there are some conflicts with the base branch.

@erhoo82 erhoo82 force-pushed the slym/mlperf_merge branch 3 times, most recently from 70343d8 to 2065563 Compare December 9, 2023 02:16
@erhoo82
Copy link
Collaborator Author

erhoo82 commented Dec 9, 2023

Thanks!
Fixed the typo and resolved the conflict.

@ericharper
Copy link
Collaborator

jenkins

ericharper
ericharper previously approved these changes Dec 10, 2023
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper
Copy link
Collaborator

jenkins

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 30, 2023
…st rank

Signed-off-by: Sangkug Lym <slym@nvidia.com>

Add a default SHARP setting to arg list

Signed-off-by: Sangkug Lym <slym@nvidia.com>

cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>
@erhoo82 erhoo82 force-pushed the slym/mlperf_merge branch 3 times, most recently from 0ef3363 to f4b5515 Compare January 2, 2024 01:18
Signed-off-by: Sangkug Lym <slym@nvidia.com>
@erhoo82 erhoo82 force-pushed the slym/mlperf_merge branch from f4b5515 to c12cee7 Compare January 2, 2024 01:31
@github-actions github-actions bot removed the stale label Jan 2, 2024
@erhoo82
Copy link
Collaborator Author

erhoo82 commented Jan 2, 2024

jenkins

@erhoo82 erhoo82 merged commit 7e1bf36 into main Jan 2, 2024
15 checks passed
@erhoo82 erhoo82 deleted the slym/mlperf_merge branch January 2, 2024 15:02
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
…ss to the first rank instead of b-cast (NVIDIA#7793)

* (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank

Signed-off-by: Sangkug Lym <slym@nvidia.com>

Add a default SHARP setting to arg list

Signed-off-by: Sangkug Lym <slym@nvidia.com>

cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

---------

Signed-off-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
@janEbert janEbert mentioned this pull request Jan 13, 2024
8 tasks
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
…ss to the first rank instead of b-cast (NVIDIA#7793)

* (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank

Signed-off-by: Sangkug Lym <slym@nvidia.com>

Add a default SHARP setting to arg list

Signed-off-by: Sangkug Lym <slym@nvidia.com>

cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

---------

Signed-off-by: Sangkug Lym <slym@nvidia.com>
Signed-off-by: Sasha Meister <ameister@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…ss to the first rank instead of b-cast (NVIDIA#7793)

* (1) SHARP for DP proc group, (2) Use send/recv loss_mean logging at 1st rank

Signed-off-by: Sangkug Lym <slym@nvidia.com>

Add a default SHARP setting to arg list

Signed-off-by: Sangkug Lym <slym@nvidia.com>

cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

* cleanup

Signed-off-by: Sangkug Lym <slym@nvidia.com>

---------

Signed-off-by: Sangkug Lym <slym@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants