This repository aims to apply different optimization techniques to different models. The models we focused on is ResneXt and SimCLS. And the optimization techniques are Data loading, Mixed precision and Checkpoint intermediate buffers.
The repository contains two folders, resnext cantain all related files to model ResneXt. And SimCLS contains the file for SimCLS.
The code of ResneXt model in models are muldified based on kuangliu’s implementation. Experiments are performed using A100 GPU on google cloud platform. Each model is trained for 200 epochs using adam with learning rate 0.01 and batch size 128 unless otherwise specified. The dataset used is cifar10.
Nvidia Nsight System is used to profiling the training process.
nsys profile -t cuda,nvtx --cuda-memory-usage true
is used to trace cuda, nvtx and memory.
Commands used to run the training for resnext can be found here
Command to profile the baseline model: nsys profile -o profile/resnextBaselineProfile --cuda-memory-usage true -t cuda,nvtx --force-overwrite true python3 resnext.py
resnext.py is the main file used in the experiment. By default, running it without any addition argument will train the baseline model (i.e. 2 worker, single precision, no checkpoint). After training finishes, it will store 2 files, one is the state dict of the trsined model, and the other is the loss, training accuracy and test accuracy per epoch.
There are several commandline arguments that can be specified when training.
--num_worker
number of worker for data loader, default is 2--half_precision
whether to use half precision--num_epoch
number of training epoch, default is 200--batch_size
training batch size, default is 128--lr
learning rate, default is 0.01
resnext_checkpoint.py is the checkpoint version of previous code. It has the same commandline arguments as resnext.py.
python3
conda create --name env --file spec-file.txt
pip3 install -r requirements.txt
compare_mt
-> https://github.com/neulab/compare-mt
main.py
-> training scorer modelmodel.py
-> modelsdata_utils.py
-> dataloaderutils.py
-> utility functionspreprocess.py
-> data preprocessinggenerat_cand.py
-> generate candidate summaries for trainingfinetune_model.py
-> finetune your own generative modelevaluate_model.py
-> evalualte model with trained scorer
Following directories should be created for our experiments.
./cache
-> storing model checkpoints
Need to know that the dataset in this repo clean_covid.csv is just a sample dataset only contain 10000 records, if you want to access to the full data, please refer to the following link.
To generate candidates please run:
!python gen_candidate.py --generator_name {args.generator_name} --dataset_name {args.dataset_name} --dataset_percent {args.dataset_percent} --num_cands {args.num_cands}
generator_name: is the path to previously finetuned generator. Here in our case we use a T5_small model finetuned on CORD dataset.
dataset_name: is the path to dataset. (need to be a csv file, and column name for source document should be abstract, column name for reference summary should be title).
dataset_percent: percent of data are used to generate, for test you can use smal percent of dataset to debug. Default to 100.
num_cands: Num of candidates you want to generate.
Generated candidate are stored in the forder 'candidates/{args.generator_name}_{args.num_cands}'.
For data preprocessing, please run
python preprocess.py --src_dir [path of the raw data] --tgt_dir [output path] --split [train/val/test] --cand_num [number of candidate summaries]
src_dir
is the candidate folder: 'candidates/{args.generator_name}_{args.num_cands}'.
The preprocessing precedure will store the processed data as seperate json files in tgt_dir
.
You may specify the hyper-parameters in main.py
.
python main.py --cuda --gpuid [list of gpuid] -l
python main.py --cuda --gpuid [list of gpuid] -l --model_pt [model path]
model path should be a subdirectory in the ./cache
directory, e.g. cnndm/model.pt
(it shouldn't contain the prefix ./cache/
).
python evaluate_model.py --generator_name {args.generator_name} --dataset_name {args.dataset_name} --scorer_path cache/22-12-17-0/scorer.bin --dataset_percent 10
Baseline (2 worker) | 0 worker | 1 worker | 4 worker | 8 worker | |
---|---|---|---|---|---|
Load time | 1.6 ms | 63 ms | 1.9 ms | 1.6 ms | 1.6 ms |
Step time | 104 ms | 104 ms | 105 ms | 104 ms | 104 ms |
Epoch time | 44.5 s | 67.7 s | 44.6 s | 44.7 s | 44.9 s |
With 1 worker, loading time is 30 times faster than 0 worker. Only a small improvement in time when increase worker from 1 to 2. No improvements using more than 2 workers.
The loss and accuracy are about the same, as data loading should only affect the training time. Only the experiment with four workers has slightly better performance, this might caused due to getting a good random seed.
Baseline | Half precision | |
---|---|---|
Step time | 104 ms | 78 ms |
Epoch time | 44.5 s | 33.7 s |
Overall time | 8934 s | 6744 s |
The training with half precision is much faster, it saves about a quarter of the time. After scaling the loss, half precision also has a slightly better performance compare to baseline.
Baseline | Checkpoint | Baseline batch 512 | Checkpoint batch 896 | |
---|---|---|---|---|
Optimizer time | 28 ms | 34 ms | 200 ms | 560 ms |
Step time | 104 ms | 152 ms | 380 ms | 948 ms |
Epoch time | 44.5 s | 63.5 s | 40 s | 56.5 s |
Overall time | 8934 s | 12703 s | 8097 s | 11302 s |
Memory | 7.82 GB | 5.11 GB | 33.79 GB | 33.93 GB |
By looking at the result, checkpoint adds a lot of overhead in computation time. The training time is 1.5 times the baseline, but the memory used is about ⅔. This allows the model to train with larger batch size. By scale up the memory consumption to 34 GB, the baseline can run with batch size 512 and checkpoint can run with 896. In both cases larger batch runs faster than small batch. But overall checkpoint still runs slower than baseline.
With the same batch size, checkpoint does not affect loss and accuracy. By running with larger batch, the performance is improved significantly. This might caused because of the original learning rate is too high. Increase batch size has same effect as decrease learning rate, so it results a better performance.
Change number of workers didn't change training loss, but add checkpointing and change to half precision can decrease the training loss, This is argely because random state and the model is underfitting.
BaseLine | 0 workers | 1 workers | 4 workers | 8 workers | Half precision | Checkpointing | |
---|---|---|---|---|---|---|---|
Data loading time | 2.23 | 2.79 | 2.34 | 2.27 | 2.25 | 2.41 | 2.34 |
Optimize time(ms) | 33 | 33 | 33 | 33 | 33 | 25 | 42 |
Training Loss(ranking loss) | 0.28 | 0.28 | 0.28 | 0.28 | 0.28 | 0.2 | . 0.2 |
Total training time(s) | 6748.42 | 6848.11 | 6736.93 | 6724.27 | 6728.31 | 5067.59 | 9619.41 |
Before SimCLS | 0 workers | 1 workers | 2 workers | 4 workers | 8 workers | Half precision | Checkpointing | |
---|---|---|---|---|---|---|---|---|
ROUGE-1 | 0.4267 | 0.4245 | 0.4245 | 0.4245 | 0.4245 | 0.4245 | 0.4283 | 0.4281 |
ROUGE-2 | 0.2223 | 0.2095 | 0.2095 | 0.2095 | 0.2095 | 0.2095 | 0.2093 | 0.2126 |
ROUGE-L | 0.3659 | 0.3573 | 0.3573 | 0.3573 | 0.3573 | 0.3573 | 0.3603 | 0.3608 |