In this section, we introduce how to test and train models in MMagic.
In this section, we provide the following guides:
- Tutorial 4: Train and test in MMagic
Users need to prepare dataset first to enable training and testing models in MMagic.
You can use the following commands to test a pre-trained model with single GPUs.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE}
For example,
python tools/test.py configs/example_config.py work_dirs/example_exp/example_model_20200202.pth
MMagic supports testing with multiple GPUs, which can largely save your time in testing models. You can use the following commands to test a pre-trained model with multiple GPUs.
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM}
For example,
./tools/dist_test.sh configs/example_config.py work_dirs/example_exp/example_model_20200202.pth
If you run MMagic on a cluster managed with slurm, you can use the script slurm_test.sh
. (This script also supports single machine testing.)
[GPUS=${GPUS}] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE}
Here is an example of using 8 GPUs to test an example model on the 'dev' partition with the job name 'test'.
GPUS=8 ./tools/slurm_test.sh dev test configs/example_config.py work_dirs/example_exp/example_model_20200202.pth
You can check slurm_test.sh for full arguments and environment variables.
MMagic provides various evaluation metrics, i.e., MS-SSIM, SWD, IS, FID, Precision&Recall, PPL, Equivarience, TransFID, TransIS, etc.
We have provided unified evaluation scripts in tools/test.py for all models.
If users want to evaluate their models with some metrics, you can add the metrics
into your config file like this:
# at the end of the configs/styleganv2/stylegan2_c2_ffhq_256_b4x8_800k.py
metrics = [
dict(
type='FrechetInceptionDistance',
prefix='FID-Full-50k',
fake_nums=50000,
inception_style='StyleGAN',
sample_model='ema'),
dict(type='PrecisionAndRecall', fake_nums=50000, prefix='PR-50K'),
dict(type='PerceptualPathLength', fake_nums=50000, prefix='ppl-w')
]
As above, metrics
consist of multiple metric dictionaries. Each metric will contain type
to indicate the category of the metric. fake_nums
denotes the number of images generated by the model. Some metrics will output a dictionary of results, you can also set prefix
to specify the prefix of the results.
If you set the prefix of FID as FID-Full-50k
, then an example of output may be
FID-Full-50k/fid: 3.6561 FID-Full-50k/mean: 0.4263 FID-Full-50k/cov: 3.2298
Then users can test models with the command below:
bash tools/dist_test.sh ${CONFIG_FILE} ${CKPT_FILE}
If you are in slurm environment, please switch to the tools/slurm_test.sh by using the following commands:
sh slurm_test.sh ${PLATFORM} ${JOBNAME} ${CONFIG_FILE} ${CKPT_FILE}
MMagic supports multiple ways of training:
Specifically, all outputs (log files and checkpoints) will be saved to the working directory,
which is specified by work_dir
in the config file.
CUDA_VISIBLE=0 python tools/train.py configs/example_config.py --work-dir work_dirs/example
To launch distributed training on multiple machines, which can be accessed via IPs, run the following commands:
On the first machine:
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR tools/dist_train.sh $CONFIG $GPUS
On the second machine:
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR tools/dist_train.sh $CONFIG $GPUS
To speed up network communication, high speed network hardware, such as Infiniband, is recommended. Please refer to PyTorch docs for more information.
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
If you run MMagic on a cluster managed with slurm, you can use the script slurm_train.sh
. (This script also supports single machine training.)
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
Here is an example of using 8 GPUs to train an inpainting model on the dev partition.
GPUS=8 ./tools/slurm_train.sh dev configs/inpainting/gl_places.py /nfs/xxxx/gl_places_256
You can check slurm_train.sh for full arguments and environment variables.
--amp
: This argument is used for fixed-precision training.--resume
: This argument is used for auto resume if the training is aborted.
Benefit from the mmengine
's Runner
. We can evaluate model during training in a simple way as below.
# define metrics
metrics = [
dict(
type='FrechetInceptionDistance',
prefix='FID-Full-50k',
fake_nums=50000,
inception_style='StyleGAN')
]
# define dataloader
val_dataloader = dict(
batch_size=128,
num_workers=8,
dataset=dict(
type='BasicImageDataset',
data_root='data/celeba-cropped/',
pipeline=[
dict(type='LoadImageFromFile', key='img'),
dict(type='Resize', scale=(64, 64)),
dict(type='PackInputs')
]),
sampler=dict(type='DefaultSampler', shuffle=False),
persistent_workers=True)
# define val interval
train_cfg = dict(by_epoch=False, val_begin=1, val_interval=10000)
# define val loop and evaluator
val_cfg = dict(type='MultiValLoop')
val_evaluator = dict(type='Evaluator', metrics=metrics)
You can set val_begin
and val_interval
to adjust when to begin validation and interval of validation.
For details of metrics, refer to metrics' guide.