Skip to content

Latest commit

 

History

History
216 lines (154 loc) · 7.31 KB

train_test.md

File metadata and controls

216 lines (154 loc) · 7.31 KB

Tutorial 4: Train and test in MMagic

In this section, we introduce how to test and train models in MMagic.

In this section, we provide the following guides:

Prerequisite

Users need to prepare dataset first to enable training and testing models in MMagic.

Test a model in MMagic

Test with a single GPUs

You can use the following commands to test a pre-trained model with single GPUs.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE}

For example,

python tools/test.py configs/example_config.py work_dirs/example_exp/example_model_20200202.pth

Test with multiple GPUs

MMagic supports testing with multiple GPUs, which can largely save your time in testing models. You can use the following commands to test a pre-trained model with multiple GPUs.

./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM}

For example,

./tools/dist_test.sh configs/example_config.py work_dirs/example_exp/example_model_20200202.pth

Test with Slurm

If you run MMagic on a cluster managed with slurm, you can use the script slurm_test.sh. (This script also supports single machine testing.)

[GPUS=${GPUS}] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE}

Here is an example of using 8 GPUs to test an example model on the 'dev' partition with the job name 'test'.

GPUS=8 ./tools/slurm_test.sh dev test configs/example_config.py work_dirs/example_exp/example_model_20200202.pth

You can check slurm_test.sh for full arguments and environment variables.

Test with specific metrics

MMagic provides various evaluation metrics, i.e., MS-SSIM, SWD, IS, FID, Precision&Recall, PPL, Equivarience, TransFID, TransIS, etc. We have provided unified evaluation scripts in tools/test.py for all models. If users want to evaluate their models with some metrics, you can add the metrics into your config file like this:

# at the end of the configs/styleganv2/stylegan2_c2_ffhq_256_b4x8_800k.py
metrics = [
    dict(
        type='FrechetInceptionDistance',
        prefix='FID-Full-50k',
        fake_nums=50000,
        inception_style='StyleGAN',
        sample_model='ema'),
    dict(type='PrecisionAndRecall', fake_nums=50000, prefix='PR-50K'),
    dict(type='PerceptualPathLength', fake_nums=50000, prefix='ppl-w')
]

As above, metrics consist of multiple metric dictionaries. Each metric will contain type to indicate the category of the metric. fake_nums denotes the number of images generated by the model. Some metrics will output a dictionary of results, you can also set prefix to specify the prefix of the results. If you set the prefix of FID as FID-Full-50k, then an example of output may be

FID-Full-50k/fid: 3.6561  FID-Full-50k/mean: 0.4263  FID-Full-50k/cov: 3.2298

Then users can test models with the command below:

bash tools/dist_test.sh ${CONFIG_FILE} ${CKPT_FILE}

If you are in slurm environment, please switch to the tools/slurm_test.sh by using the following commands:

sh slurm_test.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE}

Train a model in MMagic

MMagic supports multiple ways of training:

  1. Train with a single GPU
  2. Train with multiple GPUs
  3. Train with multiple nodes
  4. Train with Slurm

Specifically, all outputs (log files and checkpoints) will be saved to the working directory, which is specified by work_dir in the config file.

Train with a single GPU

CUDA_VISIBLE=0 python tools/train.py configs/example_config.py --work-dir work_dirs/example

Train with multiple nodes

To launch distributed training on multiple machines, which can be accessed via IPs, run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR tools/dist_train.sh $CONFIG $GPUS

To speed up network communication, high speed network hardware, such as Infiniband, is recommended. Please refer to PyTorch docs for more information.

Train with multiple GPUs

./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

Train with Slurm

If you run MMagic on a cluster managed with slurm, you can use the script slurm_train.sh. (This script also supports single machine training.)

[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}

Here is an example of using 8 GPUs to train an inpainting model on the dev partition.

GPUS=8 ./tools/slurm_train.sh dev configs/inpainting/gl_places.py /nfs/xxxx/gl_places_256

You can check slurm_train.sh for full arguments and environment variables.

Optional arguments

  • --amp: This argument is used for fixed-precision training.
  • --resume: This argument is used for auto resume if the training is aborted.

Train with specific evaluation metrics

Benefit from the mmengine's Runner. We can evaluate model during training in a simple way as below.

# define metrics
metrics = [
    dict(
        type='FrechetInceptionDistance',
        prefix='FID-Full-50k',
        fake_nums=50000,
        inception_style='StyleGAN')
]

# define dataloader
val_dataloader = dict(
    batch_size=128,
    num_workers=8,
    dataset=dict(
        type='BasicImageDataset',
        data_root='data/celeba-cropped/',
        pipeline=[
            dict(type='LoadImageFromFile', key='img'),
            dict(type='Resize', scale=(64, 64)),
            dict(type='PackInputs')
        ]),
    sampler=dict(type='DefaultSampler', shuffle=False),
    persistent_workers=True)

# define val interval
train_cfg = dict(by_epoch=False, val_begin=1, val_interval=10000)

# define val loop and evaluator
val_cfg = dict(type='MultiValLoop')
val_evaluator = dict(type='Evaluator', metrics=metrics)

You can set val_begin and val_interval to adjust when to begin validation and interval of validation.

For details of metrics, refer to metrics' guide.