Here is a tutorial how to train deep learning models on the CIFAR10 dataset on Cori-GPU platform using PyTorch.
First you'd need to request one or more GPU using the following script. See this page for further details.
module load cgpu
salloc -C gpu -N 1 -t 60 -c 10 -G 1 -A m3691
Then run the following commands to kick off training.
module load pytorch/v1.5.0-gpu
srun python main.py
Run the following commands for submitting a batch job.
sbatch train_cgpu.sh
The dashboard on my.nersc.gov
sometimes cannot correctly display jobs running on the GPU cluster, so a better way is to run jobstats
in the terminal to view the job status. When the job starts running, its status will change from PENDING
to RUNNING
.
In the batch mode, the results will be redirected to <job_id>.out
, under your working directory by default.
Run the following command for continuously training on NERSC
python -u train_nersc.py --name cifar --interval 60 > cifar.log &
The interval is # minutes between two status checking for re-launch. -u
to force no buffering.
To quickly test the script's validity, try setting time in train_cgpu.sh
to be 3 minutes and run
python train_nersc.py --interval 1
You can build your own script based on this one.
- Python 3.6+
- PyTorch 1.0+
Model | Acc. |
---|---|
VGG16 | 92.64% |
ResNet18 | 93.02% |
ResNet50 | 93.62% |
ResNet101 | 93.75% |
RegNetX_200MF | 94.24% |
RegNetY_400MF | 94.29% |
MobileNetV2 | 94.43% |
ResNeXt29(32x4d) | 94.73% |
ResNeXt29(2x64d) | 94.82% |
DenseNet121 | 95.04% |
PreActResNet18 | 95.11% |
DPN92 | 95.16% |
I manually change the lr
during training:
0.1
for epoch[0,150)
0.01
for epoch[150,250)
0.001
for epoch[250,350)
Resume the training with python main.py --resume --lr=0.01