Tutorial: Train CIFAR10 with PyTorch on NERSC Cori-GPU

Here is a tutorial how to train deep learning models on the CIFAR10 dataset on Cori-GPU platform using PyTorch.

Submit an interactive job

First you'd need to request one or more GPU using the following script. See this page for further details.

module load cgpu
salloc -C gpu -N 1 -t 60 -c 10 -G 1 -A m3691

Then run the following commands to kick off training.

module load pytorch/v1.5.0-gpu
srun python main.py

Submit a batch job

Run the following commands for submitting a batch job.

sbatch train_cgpu.sh

The dashboard on my.nersc.gov sometimes cannot correctly display jobs running on the GPU cluster, so a better way is to run jobstats in the terminal to view the job status. When the job starts running, its status will change from PENDING to RUNNING.

In the batch mode, the results will be redirected to <job_id>.out, under your working directory by default.

Continuously training on NERSC

Run the following command for continuously training on NERSC

python -u train_nersc.py --name cifar --interval 60 > cifar.log &

The interval is # minutes between two status checking for re-launch. -u to force no buffering.

To quickly test the script's validity, try setting time in train_cgpu.sh to be 3 minutes and run

python train_nersc.py --interval 1

You can build your own script based on this one.

Prerequisites

Python 3.6+
PyTorch 1.0+

Accuracy

Model	Acc.
VGG16	92.64%
ResNet18	93.02%
ResNet50	93.62%
ResNet101	93.75%
RegNetX_200MF	94.24%
RegNetY_400MF	94.29%
MobileNetV2	94.43%
ResNeXt29(32x4d)	94.73%
ResNeXt29(2x64d)	94.82%
DenseNet121	95.04%
PreActResNet18	95.11%
DPN92	95.16%

Learning rate adjustment

I manually change the lr during training:

0.1 for epoch [0,150)
0.01 for epoch [150,250)
0.001 for epoch [250,350)

Resume the training with python main.py --resume --lr=0.01

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
models		models
LICENSE		LICENSE
README.md		README.md
main.py		main.py
train_cgpu.sh		train_cgpu.sh
train_nersc.py		train_nersc.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial: Train CIFAR10 with PyTorch on NERSC Cori-GPU

Submit an interactive job

Submit a batch job

Continuously training on NERSC

Prerequisites

Accuracy

Learning rate adjustment

About

Releases

Packages

Languages

License

C3-ASV-Team/pytorch-cifar

Folders and files

Latest commit

History

Repository files navigation

Tutorial: Train CIFAR10 with PyTorch on NERSC Cori-GPU

Submit an interactive job

Submit a batch job

Continuously training on NERSC

Prerequisites

Accuracy

Learning rate adjustment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages