[Feature] Add multi machine dist_train (open-mmlab#114)

* support multi nodes * update training doc * fix lints * remove fixed seed
humu789 · Mar 27, 2022 · 3403238 · 3403238
1 parent 2d42fbf
commit 3403238
Show file tree

Hide file tree

Showing 7 changed files with 175 additions and 13 deletions.
diff --git a/docs/en/train.md b/docs/en/train.md
@@ -1,4 +1,4 @@
-# Train a model with our algorithms
+# Train different type algorithms
 
 Currently our algorithms support [mmclassification](https://mmclassification.readthedocs.io/en/latest/), [mmdetection ](https://mmdetection.readthedocs.io/en/latest/)and [mmsegmentation](https://mmsegmentation.readthedocs.io/en/latest/). **Before running our algorithms, you may need to prepare the datasets according to the instructions in the  corresponding  document.**
 
@@ -56,3 +56,99 @@ python tools/${task}/train_${task}.py ${CONFIG_FILE} --cfg-options algorithm.dis
 ```
 
 - `TEACHER_CHECKPOINT_PATH`: Path of `teacher_checkpoint`. `teacher_checkpoint` represents **checkpoint of teacher model**, used to specify different checkpoints for distillation.
+
+
+# Train with different devices
+
+**Note**: The default learning rate in config files is for 8 GPUs. If using different number GPUs, the total batch size will change in proportion, you have to scale the learning rate following `new_lr = old_lr * new_ngpus / old_ngpus`. We recommend to use `tools/xxx/dist_train.sh` even with 1 gpu, since some methods do not support non-distributed training.
+
+### Training with CPU
+
+```shell
+export CUDA_VISIBLE_DEVICES=-1
+python tools/train.py ${CONFIG_FILE}
+```
+
+**Note**: We do not recommend users to use CPU for training because it is too slow and some algorithms are using `SyncBN` which is based on distributed training. We support this feature to allow users to debug on machines without GPU for convenience.
+
+### Train with single/multiple GPUs
+
+```shell
+sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS} --work_dir ${YOUR_WORK_DIR} [optional arguments]
+```
+
+
+**Note**: During training, checkpoints and logs are saved in the same folder structure as the config file under `work_dirs/`. Custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use symlink, for example:
+
+```shell
+ln -s ${YOUR_WORK_DIRS} ${MMRAZOR}/work_dirs
+```
+
+Alternatively, if you run MMRazor on a cluster managed with [slurm](https://slurm.schedmd.com/):
+
+```shell
+GPUS_PER_NODE=${GPUS_PER_NODE} GPUS=${GPUS} SRUN_ARGS=${SRUN_ARGS} sh tools/xxx/slurm_train_xxx.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${YOUR_WORK_DIR} [optional arguments]
+```
+
+### Train with multiple machines
+
+If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
+
+On the first machine:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/xxx/dist_train.sh $CONFIG $GPUS
+```
+
+On the second machine:
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/xxx/dist_train.sh $CONFIG $GPUS
+```
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+If you launch with slurm, the command is the same as that on single machine described above, but you need refer to [slurm_train.sh](https://github.com/open-mmlab/mmselfsup/blob/master/tools/slurm_train.sh) to set appropriate parameters and environment variables.
+
+### Launch multiple jobs on a single machine
+
+If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict.
+
+If you use `dist_train.sh` to launch training jobs:
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/xxx/dist_train.sh ${CONFIG_FILE} 4 --work_dir tmp_work_dir_1
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/xxx/dist_train.sh ${CONFIG_FILE} 4 --work_dir tmp_work_dir_2
+```
+
+If you use launch training jobs with slurm, you have two options to set different communication ports:
+
+Option 1:
+
+In `config1.py`:
+
+```python
+dist_params = dict(backend='nccl', port=29500)
+```
+
+In `config2.py`:
+
+```python
+dist_params = dict(backend='nccl', port=29501)
+```
+
+Then you can launch two jobs with config1.py and config2.py.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/xxx/slurm_train_xxx.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/xxx/slurm_train_xxx.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
+```
+
+Option 2:
+
+You can set different communication ports without the need to modify the configuration file, but have to set the `cfg-options` to overwrite the default port in configuration file.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/xxx/slurm_train_xxx.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1 --cfg-options dist_params.port=29500
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/xxx/slurm_train_xxx.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2 --cfg-options dist_params.port=29501
+```
diff --git a/tools/mmcls/dist_test.sh b/tools/mmcls/dist_test.sh
@@ -3,8 +3,20 @@
 CONFIG=$1
 CHECKPOINT=$2
 GPUS=$3
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
 
 PYTHONPATH="$(dirname $0)/../..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/test_mmcls.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/mmcls/test_mmcls.py \
+    $CONFIG \
+    $CHECKPOINT \
+    --launcher pytorch \
+    ${@:4}
diff --git a/tools/mmcls/dist_train.sh b/tools/mmcls/dist_train.sh
@@ -2,8 +2,18 @@
 
 CONFIG=$1
 GPUS=$2
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
 
 PYTHONPATH="$(dirname $0)/../..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/train_mmcls.py $CONFIG --launcher pytorch ${@:3}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/mmcls/train_mmcls.py \
+    $CONFIG \
+    --launcher pytorch ${@:3}
diff --git a/tools/mmdet/dist_test.sh b/tools/mmdet/dist_test.sh
@@ -3,8 +3,20 @@
 CONFIG=$1
 CHECKPOINT=$2
 GPUS=$3
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
 
 PYTHONPATH="$(dirname $0)/../..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/test_mmdet.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/mmdet/test_mmdet.py \
+    $CONFIG \
+    $CHECKPOINT \
+    --launcher pytorch \
+    ${@:4}
diff --git a/tools/mmdet/dist_train.sh b/tools/mmdet/dist_train.sh
@@ -2,8 +2,18 @@
 
 CONFIG=$1
 GPUS=$2
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
 
 PYTHONPATH="$(dirname $0)/../..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/train_mmdet.py $CONFIG --launcher pytorch ${@:3}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/mmdet/train_mmdet.py \
+    $CONFIG \
+    --launcher pytorch ${@:3}
diff --git a/tools/mmseg/dist_test.sh b/tools/mmseg/dist_test.sh
@@ -3,8 +3,20 @@
 CONFIG=$1
 CHECKPOINT=$2
 GPUS=$3
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
 
 PYTHONPATH="$(dirname $0)/../..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/test_mmseg.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/mmseg/test_mmseg.py \
+    $CONFIG \
+    $CHECKPOINT \
+    --launcher pytorch \
+    ${@:4}
diff --git a/tools/mmseg/dist_train.sh b/tools/mmseg/dist_train.sh
@@ -2,8 +2,18 @@
 
 CONFIG=$1
 GPUS=$2
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
 
 PYTHONPATH="$(dirname $0)/../..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/train_mmseg.py $CONFIG --launcher pytorch ${@:3}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/mmseg/train_mmseg.py \
+    $CONFIG \
+    --launcher pytorch ${@:3}