-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version for multinode auto_runner and ensembler #6272
Conversation
Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:
|
@mingxin-zheng The files are partitioned to all nodes and gpus. So if you have 16 files, 2 nodes 16 GPUs, then each GPU will sequentially run 5 fold models on 1 file. I also think the for loop should be changed to
Since loading model weights can be slow. |
Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer. By the way, |
Thanks, we can keep the current for loop. It's safer. |
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
for more information, see https://pre-commit.ci
/black |
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
for more information, see https://pre-commit.ci
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
for more information, see https://pre-commit.ci
Thanks @heyufan1995 for the PR. The design looks good to me as we minimize the number of breaking changes. There are some format issues and test failures to fix. CC @wyli for viz. |
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
/integration-test |
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
integration verified https://github.com/Project-MONAI/MONAI/actions/runs/4699028053/jobs/8332016913 I'm merging this to unblock benchmarking tasks.
Fixes #6191
fixes #6259 .
Description
Big changes over autorunner to enable multinode training and multinode-multiGPU ensembler
Multiple changes:
Passed some quick local testing. Needs to fix details and do test. Created PR to do a initial design pattern discussion. Slack me if there is any major concern of the change.
@mingxin-zheng @wyli