This is a simple script to quickly benchmark a multiple-GPU system on an NLP task.
Specifically, it fine-tunes an English BERT-Large language model on three GLUE tasks:
- QQP
- MNLI
- QNLI
Simply run ai_nlp_benchmark.sh
. It will automatically download the BERT model and the GLUE datasets.
Then it will sequentially fine-tune the model on each dataset and save the resulting models and their evaluation results to corresponding sub-directories
The code uses 4 GPUs by default, one can change it in the accelerate_config.yaml
file (num_processes
).
The default per-device batch size is 32, can be decreased (in the ai_nlp_benchmark.sh
script) if it is too large for the devices under evaluation.
Every task should take about 1-3 hours.
The evaluation results can be found in the corresponding sub-directories (for example, qnli_results/all_results.json
).
The scores should not be significantly different from the following:
- QQP: accuracy 0.91, F1 0.88
- MNLI: accuracy 0.87
- QNLI: accuracy 0.92