[Feature] Add multi machine dist_train. (open-mmlab#1383)

* Add training startup documentation * fix * fix * fix * fix * fix * fix * fix * fix * fix
AetrexTechnology · Mar 18, 2022 · 415b20f · 415b20f
1 parent 786a9f7
commit 415b20f
Show file tree

Hide file tree

Showing 5 changed files with 205 additions and 54 deletions.
diff --git a/docs/en/train.md b/docs/en/train.md
@@ -17,12 +17,14 @@ Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cros
 
 To trade speed with GPU memory, you may pass in `--cfg-options model.backbone.with_cp=True` to enable checkpoint in backbone.
 
-### Train with a single GPU
+### Train on a single machine
+
+#### Train with a single GPU
 
 official support:
 
 ```shell
-./tools/dist_train.sh ${CONFIG_FILE} 1 [optional arguments]
+sh tools/dist_train.sh ${CONFIG_FILE} 1 [optional arguments]
 ```
 
 experimental support (Convert SyncBN to BN):
@@ -33,7 +35,7 @@ python tools/train.py ${CONFIG_FILE} [optional arguments]
 
 If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`.
 
-### Train with CPU
+#### Train with CPU
 
 The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
 
@@ -47,10 +49,10 @@ And then run the script [above](#train-with-a-single-gpu).
 The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
 ```
 
-### Train with multiple GPUs
+#### Train with multiple GPUs
 
 ```shell
-./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
+sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
 ```
 
 Optional arguments are:
@@ -59,47 +61,109 @@ Optional arguments are:
 - `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
 - `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file (to continue the training process).
 - `--load-from ${CHECKPOINT_FILE}`: Load weights from a checkpoint file (to start finetuning for another task).
+- `--deterministic`: Switch on "deterministic" mode which slows down training but the results are reproducible.
 
 Difference between `resume-from` and `load-from`:
 
 - `resume-from` loads both the model weights and optimizer state including the iteration number.
 - `load-from` loads only the model weights, starts the training from iteration 0.
 
+An example:
+
+```shell
+# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_512x512_80k_ade20k/
+# If work_dir is not set, it will be generated automatically.
+sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_512x512_80k_ade20k.py 8 --work_dir work_dirs/pspnet_r50-d8_512x512_80k_ade20k/ --deterministic
+```
+
+**Note**: During training, checkpoints and logs are saved in the same folder structure as the config file under `work_dirs/`. Custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use symlink, for example:
+
+```shell
+ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs
+```
+
+#### Launch multiple jobs on a single machine
+
+If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`.
+
+If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
+```
+
 ### Train with multiple machines
 
-If you run MMSegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)
+If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
+
+On the first machine:
 
 ```shell
-[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
+```
+
+On the second machine:
+
+```shell
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
+```
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+### Manage jobs with Slurm
+
+Slurm is a good job scheduling system for computing clusters. On a cluster managed by Slurm, you can use slurm_train.sh to spawn training jobs. It supports both single-node and multi-node training.
+
+Train with multiple machines:
+
+```shell
+[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
 ```
 
 Here is an example of using 16 GPUs to train PSPNet on the dev partition.
 
 ```shell
-GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes
+GPUS=16 sh tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py work_dirs/pspnet_r50-d8_512x1024_40k_cityscapes/
 ```
 
-You can check [slurm_train.sh](../tools/slurm_train.sh) for full arguments and environment variables.
+When using 'slurm_train.sh' to start multiple tasks on a node, different ports need to be specified. Three settings are provided:
 
-If you have just multiple machines connected with ethernet, you can refer to
-PyTorch [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
-Usually it is slow if you do not have high speed networking like InfiniBand.
+Option 1:
 
-### Launch multiple jobs on a single machine
+In `config1.py`:
 
-If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
-you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`.
+```python
+dist_params = dict(backend='nccl', port=29500)
+```
 
-If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`.
+In `config2.py`:
+
+```python
+dist_params = dict(backend='nccl', port=29501)
+```
+
+Then you can launch two jobs with config1.py and config2.py.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
+```
+
+Option 2:
+
+You can set different communication ports without the need to modify the configuration file, but have to set the `cfg-options` to overwrite the default port in configuration file.
 
 ```shell
-CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
-CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1 --cfg-options dist_params.port=29500
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2 --cfg-options dist_params.port=29501
 ```
 
-If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`.
+Option 3:
+
+You can set the port in the command using the environment variable 'MASTER_PORT':
 
 ```shell
-MASTER_PORT=29500 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
-MASTER_PORT=29501 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
 ```
diff --git a/docs/zh_cn/train.md b/docs/zh_cn/train.md
@@ -15,15 +15,17 @@ evaluation = dict(interval=4000)  # 每4000 iterations 评估一次模型的性
 
 我们可以在训练速度和 GPU 显存之间做平衡。当模型或者 Batch Size 比较大的时，可以传递`--cfg-options model.backbone.with_cp=True` ，使用 `with_cp` 来节省显存，但是速度会更慢，因为原先使用 `ith_cp` 时，是逐层反向传播(Back Propagation, BP)，不会保存所有的梯度。
 
-### 使用单卡 GPU 训练
+### 使用单台机器训练
+
+#### 使用单卡 GPU 训练
 
 ```shell
-python tools/train.py ${配置文件} [可选参数]
+python tools/train.py ${CONFIG_FILE} [可选参数]
 ```
 
 如果您想在命令里定义工作文件夹路径，您可以添加一个参数`--work-dir ${工作路径}`。
 
-### 使用 CPU 训练
+#### 使用 CPU 训练
 
 使用 CPU 训练的流程和使用单 GPU 训练的流程一致，我们仅需要在训练流程开始前禁用 GPU。
 
@@ -37,10 +39,10 @@ export CUDA_VISIBLE_DEVICES=-1
 我们不推荐用户使用 CPU 进行训练，这太过缓慢。我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
 ```
 
-### 使用多卡 GPU 训练
+#### 使用多卡 GPU 训练
 
 ```shell
-./tools/dist_train.sh ${配置文件} ${GPU 个数} [可选参数]
+sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS} [可选参数]
 ```
 
 可选参数可以为:
@@ -49,48 +51,109 @@ export CUDA_VISIBLE_DEVICES=-1
 - `--work-dir ${工作路径}`: 在配置文件里重写工作路径文件夹
 - `--resume-from ${检查点文件}`: 继续使用先前的检查点 (checkpoint) 文件（可以继续训练过程）
 - `--load-from ${检查点文件}`: 从一个检查点 (checkpoint) 文件里加载权重（对另一个任务进行精调）
+- `--deterministic`: 选择此模式会减慢训练速度，但结果易于复现
 
 `resume-from` 和 `load-from` 的区别:
 
 - `resume-from` 加载出模型权重和优化器状态包括迭代轮数等
 - `load-from` 仅加载模型权重，从第0轮开始训练
 
-### 使用多个机器训练
+示例:
+
+```shell
+# 模型的权重和日志将会存储在这个路径下： WORK_DIR=work_dirs/pspnet_r50-d8_512x512_80k_ade20k/
+# 如果work_dir没有被设定，它将会被自动生成
+sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_512x512_80k_ade20k.py 8 --work_dir work_dirs/pspnet_r50-d8_512x512_80k_ade20k/ --deterministic
+```
+
+**注意**: 在训练时，模型的和日志保存在“work_dirs/”下的配置文件的相同文件夹结构中。不建议使用自定义的“work_dirs/”，因为验证脚本可以从配置文件名中推断工作目录。如果你想在其他地方保存模型的权重，请使用符号链接，例如:
+
+```shell
+ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs
+```
+
+#### 在单个机器上启动多个任务
+
+如果您在单个机器上启动多个任务，例如在8卡 GPU 的一个机器上有2个4卡 GPU 的训练任务，您需要特别对每个任务指定不同的端口（默认为29500）来避免通讯冲突。否则，将会有报错信息 `RuntimeError: Address already in use`。
 
-如果您在一个集群上以[slurm](https://slurm.schedmd.com/) 运行 MMSegmentation，
-您可以使用脚本 `slurm_train.sh`（这个脚本同样支持单个机器的训练）。
+如果您使用命令 `dist_train.sh` 来启动一个训练任务，您可以在命令行的用环境变量 `PORT` 设置端口:
 
 ```shell
-[GPUS=${GPU 数量}] ./tools/slurm_train.sh ${分区} ${任务名称} ${配置文件} --work-dir ${工作路径}
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
 ```
 
-这里是在 dev 分区里使用16块 GPU 训练 PSPNet 的例子。
+### 使用多台机器训练
+
+如果您想使用由 ethernet 连接起来的多台机器， 您可以使用以下命令:
+
+在第一台机器上:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
+```
+
+在第二台机器上:
 
 ```shell
-GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
 ```
 
-您可以查看 [slurm_train.sh](../tools/slurm_train.sh) 以熟悉全部的参数与环境变量。
+但是，如果您不使用高速网路连接这几台机器的话，训练将会非常慢。
 
-如果您多个机器已经有以太网连接， 您可以参考 PyTorch
-[launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility) 。
-若您没有像 InfiniBand 这样高速的网络连接，多机器训练通常会比较慢。
+### 使用slurm管理任务
 
-### 在单个机器上启动多个任务
+Slurm是一个很好的计算集群作业调度系统。在由Slurm管理的集群中，可以使用slurm_train.sh来进行训练。它同时支持单节点和多节点训练。
 
-如果您在单个机器上启动多个任务，例如在8卡 GPU 的一个机器上有2个4卡 GPU 的训练任务，您需要特别对每个任务指定不同的端口（默认为29500）来避免通讯冲突。
-否则，将会有报错信息 `RuntimeError: Address already in use`。
+在多台机器上训练：
 
-如果您使用命令 `dist_train.sh` 来启动一个训练任务，您可以在命令行的用环境变量 `PORT` 设置端口。
+```shell
+[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
+```
+
+这里有一个在dev分区上使用16块GPUs来训练PSPNet的例子:
 
 ```shell
-CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${配置文件} 4
-CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${配置文件} 4
+GPUS=16 sh tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py work_dirs/pspnet_r50-d8_512x1024_40k_cityscapes/
 ```
 
-如果您使用命令 `slurm_train.sh` 来启动训练任务，您可以在命令行的用环境变量 `MASTER_PORT` 设置端口。
+当使用 `slurm_train.sh` 在一个节点上启动多个任务时，需要指定不同的端口号，这里提供了三种设置:
+
+方式1：
+
+在`config1.py`中设置:
+
+```python
+dist_params = dict(backend='nccl', port=29500)
+```
+
+在`config2.py`中设置:
+
+```python
+dist_params = dict(backend='nccl', port=29501)
+```
+
+然后就可以使用config1.py和config2.py启动两个作业:
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
+```
+
+方式2:
+
+您可以设置不同的通信端口，而不需要修改配置文件，但必须设置“cfg-options”，以覆盖配置文件中的默认端口。
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1 --cfg-options dist_params.port=29500
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2 --cfg-options dist_params.port=29501
+```
+
+方式3:
+
+您可以使用环境变量’ MASTER_PORT ‘在命令中设置端口:
 
 ```shell
-MASTER_PORT=29500 ./tools/slurm_train.sh ${分区} ${任务名称} ${配置文件}
-MASTER_PORT=29501 ./tools/slurm_train.sh ${分区} ${任务名称} ${配置文件}
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py tmp_work_dir_1
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py tmp_work_dir_2
 ```
diff --git a/mmseg/core/evaluation/class_names.py b/mmseg/core/evaluation/class_names.py
@@ -120,10 +120,12 @@ def isaid_classes():
         'Soccer_ball_field', 'plane', 'Harbor'
     ]
 
+
 def stare_classes():
     """stare class names for external use."""
     return ['background', 'vessel']
 
+
 def cityscapes_palette():
     """Cityscapes palette for external use."""
     return [[128, 64, 128], [244, 35, 232], [70, 70, 70], [102, 102, 156],
@@ -257,10 +259,12 @@ def isaid_palette():
             [0, 0, 191], [0, 0, 255], [0, 191, 127], [0, 127, 191],
             [0, 127, 255], [0, 100, 155]]
 
+
 def stare_palette():
     """STARE palette for external use."""
     return [[120, 120, 120], [6, 230, 230]]
 
+
 dataset_aliases = {
     'cityscapes': ['cityscapes'],
     'ade': ['ade', 'ade20k'],
@@ -274,7 +278,7 @@ def stare_palette():
         'coco_stuff164k'
     ],
     'isaid': ['isaid', 'iSAID'],
-    'stare':['stare', 'STARE']
+    'stare': ['stare', 'STARE']
 }
 
 

diff --git a/tools/dist_test.sh b/tools/dist_test.sh
@@ -1,9 +1,20 @@
-#!/usr/bin/env bash
-
 CONFIG=$1
 CHECKPOINT=$2
 GPUS=$3
+NNODES=${NNODES:-1}
+NODE_RANK=${NODE_RANK:-0}
 PORT=${PORT:-29500}
+MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
+
 PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
-python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
-    $(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
+python -m torch.distributed.launch \
+    --nnodes=$NNODES \
+    --node_rank=$NODE_RANK \
+    --master_addr=$MASTER_ADDR \
+    --nproc_per_node=$GPUS \
+    --master_port=$PORT \
+    $(dirname "$0")/test.py \
+    $CONFIG \
+    $CHECKPOINT \
+    --launcher pytorch \
+    ${@:4}