Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version for multinode auto_runner and ensembler #6272

Merged
merged 25 commits into from
Apr 14, 2023

Conversation

heyufan1995
Copy link
Member

@heyufan1995 heyufan1995 commented Apr 3, 2023

Fixes #6191
fixes #6259 .

Description

Big changes over autorunner to enable multinode training and multinode-multiGPU ensembler
Multiple changes:

  1. Add set_device_info() to create a self.device_dict to define device information (CUDA_VISIBLE_DEVICES, NUM_NODE, e.t.c.) for all parts in autorunner, including data analyzer, trainer, ensembler. No global env variable is set, all device info is from self.device_dict. Changes to bundlegen is made.
  2. To enable multi-gpu/multi-node training for ensembler (call from subprocess), we need to separate the ensembler from autorunner (for subprocess to run from autorunner). Created a new EnsembleRunner class (similar to bundleGen), and moved all ensemble related function from autorunner to this class. Local multi-GPU ensembling passed.

Passed some quick local testing. Needs to fix details and do test. Created PR to do a initial design pattern discussion. Slack me if there is any major concern of the change.
@mingxin-zheng @wyli

@heyufan1995 heyufan1995 self-assigned this Apr 3, 2023
@mingxin-zheng
Copy link
Contributor

mingxin-zheng commented Apr 4, 2023

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:

for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

@heyufan1995
Copy link
Member Author

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:

for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

@mingxin-zheng The files are partitioned to all nodes and gpus. So if you have 16 files, 2 nodes 16 GPUs, then each GPU will sequentially run 5 fold models on 1 file. I also think the for loop should be changed to

for algo in algos:
    for file in files:
        pred = algo[some_key].predict(...)

Since loading model weights can be slow.

@mingxin-zheng
Copy link
Contributor

mingxin-zheng commented Apr 4, 2023

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

@heyufan1995
Copy link
Member Author

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

Thanks, we can keep the current for loop. It's safer.

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
Signed-off-by: heyufan1995 <heyufan1995@gmail.com>
@wyli
Copy link
Contributor

wyli commented Apr 11, 2023

/black

@mingxin-zheng
Copy link
Contributor

Thanks @heyufan1995 for the PR. The design looks good to me as we minimize the number of breaking changes. There are some format issues and test failures to fix. CC @wyli for viz.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>
wyli added 3 commits April 14, 2023 05:33
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
@wyli
Copy link
Contributor

wyli commented Apr 14, 2023

/integration-test

wyli added 3 commits April 14, 2023 05:55
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
Signed-off-by: Wenqi Li <wenqil@nvidia.com>
@wyli
Copy link
Contributor

wyli commented Apr 14, 2023

/build

Copy link
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integration verified https://github.com/Project-MONAI/MONAI/actions/runs/4699028053/jobs/8332016913 I'm merging this to unblock benchmarking tasks.

@wyli wyli enabled auto-merge (squash) April 14, 2023 11:31
@wyli wyli merged commit 825b8db into Project-MONAI:dev Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

autorunner support multinode multigpu devices Enable Multi-node training
3 participants