Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3686 Skip workflow run if data is empty or the specified epoch_length is 0 #3690

Merged
merged 7 commits into from
Jan 21, 2022

Conversation

Nic-Ma
Copy link
Contributor

@Nic-Ma Nic-Ma commented Jan 20, 2022

Fixes #3686 .

Description

This PR enhanced the workflow to skip run if data is empty or the specified epoch_length is 0.

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 20, 2022

/black

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 20, 2022

/build

@Nic-Ma Nic-Ma requested review from ericspod, rijobro and wyli January 20, 2022 16:27
Copy link
Contributor

@SachidanandAlle SachidanandAlle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.. should be good if we have the test case for multi-gpu as well

Signed-off-by: Nic Ma <nma@nvidia.com>
@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 21, 2022

Hi @ericspod @SachidanandAlle ,

Thanks for the review.
I spent much time today to add support for multi-gpu training case, I tried many methods, but no ideal method can guarantee our distributed communication logic works fine with less ranks that the setting, it's PyTorch or NCCL logic, we can't dynamically add or reduce ranks. For example, some handlers or metrics need all ranks to run the same logic to all-gather the result, otherwise, it will hang.
I already added unit tests for multi-gpu training and added more message in the warning for the hanging case.

Thanks.

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 21, 2022

/black

@Nic-Ma
Copy link
Contributor Author

Nic-Ma commented Jan 21, 2022

/build

@wyli wyli merged commit e96dcca into Project-MONAI:dev Jan 21, 2022
wyli pushed a commit that referenced this pull request Jan 21, 2022
… is 0 (#3690)

* [DLMED] check 0 length

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] add dist tests

Signed-off-by: Nic Ma <nma@nvidia.com>
wyli pushed a commit that referenced this pull request Jan 21, 2022
… is 0 (#3690)

* [DLMED] check 0 length

Signed-off-by: Nic Ma <nma@nvidia.com>

* [DLMED] add dist tests

Signed-off-by: Nic Ma <nma@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip workflow run if no data provided
4 participants