Initial version for multinode auto_runner and ensembler #6272

heyufan1995 · 2023-04-03T04:17:12Z

Fixes #6191
fixes #6259 .

Description

Big changes over autorunner to enable multinode training and multinode-multiGPU ensembler
Multiple changes:

Add set_device_info() to create a self.device_dict to define device information (CUDA_VISIBLE_DEVICES, NUM_NODE, e.t.c.) for all parts in autorunner, including data analyzer, trainer, ensembler. No global env variable is set, all device info is from self.device_dict. Changes to bundlegen is made.
To enable multi-gpu/multi-node training for ensembler (call from subprocess), we need to separate the ensembler from autorunner (for subprocess to run from autorunner). Created a new EnsembleRunner class (similar to bundleGen), and moved all ensemble related function from autorunner to this class. Local multi-GPU ensembling passed.

Passed some quick local testing. Needs to fix details and do test. Created PR to do a initial design pattern discussion. Slack me if there is any major concern of the change.
@mingxin-zheng @wyli

monai/apps/auto3dseg/auto_runner.py

monai/apps/auto3dseg/bundle_gen.py

mingxin-zheng · 2023-04-04T06:42:32Z

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:

for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

heyufan1995 · 2023-04-04T13:57:22Z

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:
for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

@mingxin-zheng The files are partitioned to all nodes and gpus. So if you have 16 files, 2 nodes 16 GPUs, then each GPU will sequentially run 5 fold models on 1 file. I also think the for loop should be changed to

for algo in algos:
    for file in files:
        pred = algo[some_key].predict(...)

Since loading model weights can be slow.

mingxin-zheng · 2023-04-04T14:24:14Z

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

heyufan1995 · 2023-04-04T14:48:56Z

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

Thanks, we can keep the current for loop. It's safer.

monai/apps/auto3dseg/bundle_gen.py

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

for more information, see https://pre-commit.ci

wyli · 2023-04-11T23:22:49Z

/black

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

for more information, see https://pre-commit.ci

monai/apps/auto3dseg/bundle_gen.py

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

for more information, see https://pre-commit.ci

monai/apps/auto3dseg/auto_runner.py

mingxin-zheng · 2023-04-14T06:40:44Z

Thanks @heyufan1995 for the PR. The design looks good to me as we minimize the number of breaking changes. There are some format issues and test failures to fix. CC @wyli for viz.

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

monai/apps/auto3dseg/ensemble_builder.py

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

wyli · 2023-04-14T09:47:43Z

/integration-test

monai/apps/auto3dseg/bundle_gen.py

monai/apps/auto3dseg/ensemble_builder.py

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

monai/apps/auto3dseg/ensemble_builder.py

wyli · 2023-04-14T10:54:27Z

/build

wyli

integration verified https://github.com/Project-MONAI/MONAI/actions/runs/4699028053/jobs/8332016913 I'm merging this to unblock benchmarking tasks.

heyufan1995 self-assigned this Apr 3, 2023

heyufan1995 added this to the Auto3DSeg enhancement [P0 v1.2] milestone Apr 3, 2023

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

heyufan1995 force-pushed the multinode branch from 8ab65cb to e1af5b1 Compare April 3, 2023 18:37

heyufan1995 requested review from myron and dongyang0122 April 3, 2023 18:38

mingxin-zheng reviewed Apr 4, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

heyufan1995 force-pushed the multinode branch from 5b254ff to eaba8e4 Compare April 5, 2023 15:46

wyli reviewed Apr 6, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

heyufan1995 added 4 commits April 11, 2023 13:36

Initial version for multinode auto_runner and ensembler

2af78e5

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

Fix multiple bugs and able to run end-to-end ngc multinode

cf7dbd6

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

Add cmd_prefix in env

fd89cdd

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

Update minor logging issue

4f671be

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

heyufan1995 force-pushed the multinode branch from 705c438 to 4f671be Compare April 11, 2023 20:18

Fix merge issues

a067d28

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

heyufan1995 force-pushed the multinode branch from 70159e0 to a067d28 Compare April 11, 2023 20:27

pre-commit-ci bot and others added 2 commits April 11, 2023 20:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

3735054

for more information, see https://pre-commit.ci

Merge branch 'dev' into multinode

e8fcd37

heyufan1995 and others added 3 commits April 13, 2023 00:13

Change init function position in ensemblerunner

20a5d77

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

44ff651

for more information, see https://pre-commit.ci

Merge branch 'dev' into multinode

1f7284a

heyufan1995 and others added 3 commits April 13, 2023 11:26

Add set_ensemble_method back to autorunner

c6c2d55

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

Merge branch 'multinode' of github.com:heyufan1995/MONAI into multinode

45e9fd0

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad1e681

for more information, see https://pre-commit.ci

mingxin-zheng reviewed Apr 13, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 13, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 13, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

heyufan1995 and others added 3 commits April 14, 2023 00:52

Add test case and addressing several comments

3ebed6a

Signed-off-by: heyufan1995 <heyufan1995@gmail.com>

Merge branch 'multinode' of github.com:heyufan1995/MONAI into multinode

e32c8ac

[pre-commit.ci] auto fixes from pre-commit.com hooks

0834bb0

for more information, see https://pre-commit.ci

mingxin-zheng reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/dev' into multinode

f6acd3e

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Outdated Show resolved Hide resolved

wyli added 3 commits April 14, 2023 05:33

typing fixes and unit tests

c053912

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

typing fixes and unit tests

4ddc909

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

backward compatible _create_cmd

3389672

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Show resolved Hide resolved

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Show resolved Hide resolved

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Show resolved Hide resolved

wyli added 3 commits April 14, 2023 05:55

autofix

6317377

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

compatible algo.train

b4a5d1d

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

backward compatibility

00f0fde

Signed-off-by: Wenqi Li <wenqil@nvidia.com>

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Show resolved Hide resolved

wyli approved these changes Apr 14, 2023

View reviewed changes

wyli enabled auto-merge (squash) April 14, 2023 11:31

wyli merged commit 825b8db into Project-MONAI:dev Apr 14, 2023

myron mentioned this pull request Apr 14, 2023

TestEnsembleBuilder fails: No such file or directory: 'OMP_NUM_THREADS=1' #6369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial version for multinode auto_runner and ensembler #6272

Initial version for multinode auto_runner and ensembler #6272

heyufan1995 commented Apr 3, 2023 •

edited by wyli

Loading

mingxin-zheng commented Apr 4, 2023 •

edited

Loading

heyufan1995 commented Apr 4, 2023

mingxin-zheng commented Apr 4, 2023 •

edited

Loading

heyufan1995 commented Apr 4, 2023

wyli commented Apr 11, 2023

mingxin-zheng commented Apr 14, 2023

wyli commented Apr 14, 2023 •

edited

Loading

wyli commented Apr 14, 2023

wyli left a comment

Initial version for multinode auto_runner and ensembler #6272

Initial version for multinode auto_runner and ensembler #6272

Conversation

heyufan1995 commented Apr 3, 2023 • edited by wyli Loading

Description

mingxin-zheng commented Apr 4, 2023 • edited Loading

heyufan1995 commented Apr 4, 2023

mingxin-zheng commented Apr 4, 2023 • edited Loading

heyufan1995 commented Apr 4, 2023

wyli commented Apr 11, 2023

mingxin-zheng commented Apr 14, 2023

wyli commented Apr 14, 2023 • edited Loading

wyli commented Apr 14, 2023

wyli left a comment

Choose a reason for hiding this comment

heyufan1995 commented Apr 3, 2023 •

edited by wyli

Loading

mingxin-zheng commented Apr 4, 2023 •

edited

Loading

mingxin-zheng commented Apr 4, 2023 •

edited

Loading

wyli commented Apr 14, 2023 •

edited

Loading