open-mmlab · MeowZheng · Mar 18, 2022 · Mar 16, 2022 · Mar 16, 2022 · Mar 16, 2022
diff --git a/docs/en/train.md b/docs/en/train.md
@@ -22,7 +22,7 @@ To trade speed with GPU memory, you may pass in `--cfg-options model.backbone.wi
 official support:
 
 ```shell
-./tools/dist_train.sh ${CONFIG_FILE} 1 [optional arguments]
+sh tools/dist_train.sh ${CONFIG_FILE} 1 [optional arguments]
 ```
 
 experimental support (Convert SyncBN to BN):
@@ -50,7 +50,7 @@ The process of training on the CPU is consistent with single GPU training. We ju
 ### Train with multiple GPUs
 
 ```shell
-./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
+sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
 ```
 
 Optional arguments are:
@@ -59,24 +59,67 @@ Optional arguments are:
 - `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
 - `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file (to continue the training process).
 - `--load-from ${CHECKPOINT_FILE}`: Load weights from a checkpoint file (to start finetuning for another task).
+- `--deterministic`: Switch on "deterministic" mode which slows down training but the results are reproducible.
 
 Difference between `resume-from` and `load-from`:
 
 - `resume-from` loads both the model weights and optimizer state including the iteration number.
 - `load-from` loads only the model weights, starts the training from iteration 0.
 
+An example:
+
+```shell
+# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_512x512_80k_ade20k/
+# If work_dir is not set, it will be generated automatically.
+sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_512x512_80k_ade20k.py 8 --work_dir work_dirs/pspnet_r50-d8_512x512_80k_ade20k/ --deterministic
+```
+
+**Note**: During training, checkpoints and logs are saved in the same folder structure as the config file under `work_dirs/`. Custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use symlink, for example:
+
+```shell
+ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs
+```
+
+Alternatively, if you run mmsegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/):
+
+```shell
+GPUS_PER_NODE=${GPUS_PER_NODE} GPUS=${GPUS} SRUN_ARGS=${SRUN_ARGS} sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${YOUR_WORK_DIR} [optional arguments]
+```
+
+An example:
+
+```shell
+GPUS_PER_NODE=8 GPUS=8 sh tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py work_dirs/pspnet_r50-d8_512x1024_40k_cityscapes/
+```
+
 ### Train with multiple machines
 
-If you run MMSegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)
+If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
+
+On the first machine:
+
+```shell
+NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
+```
+
+On the second machine:
 
 ```shell
-[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
+NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
+```
+
+Usually it is slow if you do not have high speed networking like InfiniBand.
+
+If you launch with slurm, the command is the same as that on single machine described above, but you need refer to [slurm_train.sh](https://github.com/open-mmlab/mmsegmentation/blob/master/tools/slurm_train.sh) to set appropriate parameters and environment variables. (This script also supports single machine training.)
+
+```shell
+[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
 ```
 
 Here is an example of using 16 GPUs to train PSPNet on the dev partition.
 
 ```shell
-GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes
+GPUS=16 sh tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py work_dirs/pspnet_r50-d8_512x1024_40k_cityscapes/
 ```
 
 You can check [slurm_train.sh](../tools/slurm_train.sh) for full arguments and environment variables.
@@ -93,13 +136,38 @@ you need to specify different ports (29500 by default) for each job to avoid com
 If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`.
 
 ```shell
-CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
-CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
+CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
+CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
 ```
 
-If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`.
+If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`. You have two options to set different communication ports:
+
+Option 1:
+
+In `config1.py`:
+
+```python
+dist_params = dict(backend='nccl', port=29500)
+```
+
+In `config2.py`:
+
+```python
+dist_params = dict(backend='nccl', port=29501)
+```
+
+Then you can launch two jobs with config1.py and config2.py.
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
+CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
+```
+
+Option 2:
+
+You can set different communication ports without the need to modify the configuration file.
 
 ```shell
-MASTER_PORT=29500 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
-MASTER_PORT=29501 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
+MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
+MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
 ```
diff --git a/mmseg/core/evaluation/class_names.py b/mmseg/core/evaluation/class_names.py
@@ -120,10 +120,12 @@ def isaid_classes():
         'Soccer_ball_field', 'plane', 'Harbor'
     ]
 
+
 def stare_classes():
     """stare class names for external use."""
     return ['background', 'vessel']
 
+
 def cityscapes_palette():
     """Cityscapes palette for external use."""
     return [[128, 64, 128], [244, 35, 232], [70, 70, 70], [102, 102, 156],
@@ -257,10 +259,12 @@ def isaid_palette():
             [0, 0, 191], [0, 0, 255], [0, 191, 127], [0, 127, 191],
             [0, 127, 255], [0, 100, 155]]
 
+
 def stare_palette():
     """STARE palette for external use."""
     return [[120, 120, 120], [6, 230, 230]]
 
+
 dataset_aliases = {
     'cityscapes': ['cityscapes'],
     'ade': ['ade', 'ade20k'],
@@ -274,7 +278,7 @@ def stare_palette():
         'coco_stuff164k'
     ],
     'isaid': ['isaid', 'iSAID'],
-    'stare':['stare', 'STARE']
+    'stare': ['stare', 'STARE']
 }