This document reports OneFlow BERT Pretrain benchmark test results on Aug 9 2020.
All tests were performed on 4 GPU Servers with 8x Tesla V100-SXM2-16GB and following is the main hardware and software configurations for each:
- Tesla V100-SXM2-16GB x 8
- InfiniBand 100 Gb/sec (4X EDR), Mellanox Technologies MT27700 Family
- 48 CPU(s), Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
- Memory 384G
- Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
- CUDA Version: 10.2, Driver Version: 440.33.01
- OneFlow: v0.1.8, fix_infer_out_logical_blob_desc@17a2bdc9b
- OneFlow-Benchmark: master@892f87e6
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity
GPU0 X NV1 NV1 NV2 NV2 SYS SYS SYS NODE 0-11,24-35
GPU1 NV1 X NV2 NV1 SYS NV2 SYS SYS NODE 0-11,24-35
GPU2 NV1 NV2 X NV2 SYS SYS NV1 SYS PIX 0-11,24-35
GPU3 NV2 NV1 NV2 X SYS SYS SYS NV1 PIX 0-11,24-35
GPU4 NV2 SYS SYS SYS X NV1 NV1 NV2 SYS 12-23,36-47
GPU5 SYS NV2 SYS SYS NV1 X NV2 NV1 SYS 12-23,36-47
GPU6 SYS SYS NV1 SYS NV1 NV2 X NV2 SYS 12-23,36-47
GPU7 SYS SYS SYS NV1 NV2 NV1 NV2 X SYS 12-23,36-47
mlx5_0 NODE NODE PIX PIX SYS SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
4 groups of tests were performed with different batch size per device: 32, 64 and 96 for BERT base, 4 for BERT large.
Each group includes 6 tests with different number of devices: 1, 2, 4, 8, 16, 32.
Throughput
of images/sec and GPU Memory Usage
were logged and recorded.
Data type of all tests is Float32
, XLA is not applied.
Please clone or download BERT
folder from OneFlow-Benchmark repository.
We create two bash scripts alone side with BERT
folder for this test:
local_run.sh
- launch a local oneflow with specific number of nodes and gpu number per node
# local_run.sh
NUM_NODES=$1
GPU_NUM_PER_NODE=$2
BENCH_ROOT_DIR=BERT
DATA_ROOT=/path/to/ofrecord
rm -rf ./log
mkdir ./log
#BSZ_PER_DEVICE=32
#BSZ_PER_DEVICE=64
BSZ_PER_DEVICE=96
python3 ./$BENCH_ROOT_DIR/run_pretraining.py \
--gpu_num_per_node=$GPU_NUM_PER_NODE \
--num_nodes=$NUM_NODES \
--node_ips='10.11.0.2','10.11.0.3','10.11.0.4','10.11.0.5' \
--learning_rate=1e-4 \
--batch_size_per_device=$BSZ_PER_DEVICE \
--iter_num=200 \
--loss_print_every_n_iter=20 \
--seq_length=128 \
--max_predictions_per_seq=20 \
--num_hidden_layers=12 \
--num_attention_heads=12 \
--max_position_embeddings=512 \
--type_vocab_size=2 \
--vocab_size=30522 \
--attention_probs_dropout_prob=0.1 \
--hidden_dropout_prob=0.1 \
--hidden_size_per_head=64 \
--data_dir=$DATA_ROOT \
--data_part_num=32 \
--log_dir=./log \
--model_save_every_n_iter=10000 \
--save_last_snapshot=False \
--model_save_dir=./snapshots
launch_all.sh
- launch oneflow on all remote nodes with specific number of nodes and gpu number per node.
# launch_all.sh
#!/bin/bash
NUM_NODES=$1
GPU_NUM_PER_NODE=$2
LOCAL_RUN=local_run.sh
BENCH_ROOT_DIR=BERT
##############################################
#0 prepare the host list for training
#comment unused hosts with `#`
#or use first arg to limit the hosts number
declare -a host_list=("10.11.0.2" "10.11.0.3" "10.11.0.4" "10.11.0.5")
if [ -n "$1" ]
then
host_num=$1
else
host_num=${#host_list[@]}
fi
if [ ${host_num} -gt ${#host_list[@]} ]
then
host_num=${#host_list[@]}
fi
hosts=("${host_list[@]:0:${host_num}}")
echo "Working on hosts:${hosts[@]}"
##############################################
#1 prepare oneflow_temp folder on each host
for host in "${hosts[@]}"
do
ssh $USER@$host "mkdir -p ~/oneflow_temp"
done
##############################################
#2 copy files to each host and start work
for host in "${hosts[@]}"
do
echo "start training on ${host}"
ssh $USER@$host 'rm -rf ~/oneflow_temp/*'
scp -r ./$BENCH_ROOT_DIR ./$LOCAL_RUN $USER@$host:~/oneflow_temp
ssh $USER@$host "cd ~/oneflow_temp; nohup ./$LOCAL_RUN $NUM_NODES $GPU_NUM_PER_NODE 1>oneflow.log 2>&1 </dev/null &"
done
Note: Please to make sure all servers can login each other automaticly with ssh-key.
# test on 1 node with 4 gpus
./launch_all.sh 1 4
# test on 4 nodes with 8 gpus per node
./launch_all.sh 4 8
Throughput(samples/s)
information as well as loss
can be found in oneflow_temp
folder in the first node's home directory, there are two files:
oneflow.log
- redirected stdoutlog/summary.csv
- same information in csv format
We use oneflow.log
for instance, here is an example:
step: 19, total_loss: 11.078, mlm_loss: 10.407, nsp_loss: 0.671, throughput: 52.257
step: 39, total_loss: 10.884, mlm_loss: 10.190, nsp_loss: 0.694, throughput: 142.735
step: 59, total_loss: 10.592, mlm_loss: 9.915, nsp_loss: 0.677, throughput: 142.636
step: 79, total_loss: 10.335, mlm_loss: 9.659, nsp_loss: 0.676, throughput: 142.391
step: 99, total_loss: 10.157, mlm_loss: 9.479, nsp_loss: 0.678, throughput: 142.565
step: 119, total_loss: 10.046, mlm_loss: 9.361, nsp_loss: 0.686, throughput: 142.397
step: 139, total_loss: 9.915, mlm_loss: 9.237, nsp_loss: 0.678, throughput: 142.298
step: 159, total_loss: 9.851, mlm_loss: 9.168, nsp_loss: 0.683, throughput: 142.383
step: 179, total_loss: 9.784, mlm_loss: 9.104, nsp_loss: 0.680, throughput: 142.270
step: 199, total_loss: 9.640, mlm_loss: 8.960, nsp_loss: 0.680, throughput: 142.579
Normally, the first throughput
value e.g. 52.257
is discarded because the start time of first batch is not correct. we average the other throughput
as the throughput of this test.
All test logs can be found here
BERT Base Pretrain, batch size per device=32, dtype=float32, without XLA
node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
---|---|---|---|---|---|---|
1 | 1 | 1 | 32 | 6207 | 140.034 | 1 |
1 | 2 | 2 | 32 | 7081 | 254.304 | 1.82 |
1 | 4 | 4 | 32 | 7255 | 506.989 | 3.62 |
1 | 8 | 8 | 32 | 7323 | 1010.446 | 7.22 |
2 | 8 | 16 | 32 | 7145 | 1571.088 | 11.22 |
4 | 8 | 32 | 32 | 7185 | 3136.797 | 22.40 |
BERT Base Pretrain, batch size per device=64, dtype=float32, without XLA
node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
---|---|---|---|---|---|---|
1 | 1 | 1 | 64 | 9989 | 145.148 | 1 |
1 | 2 | 2 | 64 | 10947 | 277.880 | 1.91 |
1 | 4 | 4 | 64 | 10955 | 552.843 | 3.81 |
1 | 8 | 8 | 64 | 11029 | 1103.102 | 7.60 |
2 | 8 | 16 | 64 | 10957 | 2023.743 | 13.94 |
4 | 8 | 32 | 64 | 10981 | 3947.739 | 27.20 |
BERT Base Pretrain, batch size per device=96, dtype=float32, without XLA
node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
---|---|---|---|---|---|---|
1 | 1 | 1 | 96 | 13771 | 145.095 | 1 |
1 | 2 | 2 | 96 | 14757 | 282.984 | 1.95 |
1 | 4 | 4 | 96 | 14851 | 559.011 | 3.85 |
1 | 8 | 8 | 96 | 14815 | 1121.632 | 7.73 |
2 | 8 | 16 | 96 | 14815 | 2132.490 | 14.70 |
4 | 8 | 32 | 96 | 14687 | 4140.439 | 28.54 |
BERT large was tested on the same situtation. Some arguments in local_run.sh
need to be modified to meet to BERT large pretrain configuration.
# local_run.sh for bert large
NUM_NODES=$1
GPU_NUM_PER_NODE=$2
BENCH_ROOT_DIR=BERT
DATA_ROOT=/path/to/ofrecord
rm -rf ./log
mkdir ./log
BSZ_PER_DEVICE=4
python3 ./$BENCH_ROOT_DIR/run_pretraining.py \
--gpu_num_per_node=$GPU_NUM_PER_NODE \
--num_nodes=$NUM_NODES \
--node_ips='10.11.0.2','10.11.0.3','10.11.0.4','10.11.0.5' \
--learning_rate=1e-4 \
--batch_size_per_device=$BSZ_PER_DEVICE \
--iter_num=200 \
--loss_print_every_n_iter=20 \
--seq_length=512 \
--max_predictions_per_seq=80 \
--num_hidden_layers=24 \
--num_attention_heads=16 \
--max_position_embeddings=512 \
--type_vocab_size=2 \
--vocab_size=30522 \
--attention_probs_dropout_prob=0.1 \
--hidden_dropout_prob=0.1 \
--hidden_size_per_head=64 \
--data_dir=$DATA_ROOT \
--data_part_num=32 \
--log_dir=./log \
--model_save_every_n_iter=10000 \
--save_last_snapshot=False \
--model_save_dir=./snapshots
Here is the result: BERT Large Pretrain, batch size per device=4, dtype=float32, without XLA
node num | gpu num/node | gpu num | bsz/gpu | GPU Memory Usage | Throughput | Speedup |
---|---|---|---|---|---|---|
1 | 1 | 1 | 4 | 12087 | 8.839 | 1 |
1 | 2 | 2 | 4 | 14593 | 16.405 | 1.86 |
1 | 4 | 4 | 4 | 14713 | 33.158 | 3.75 |
1 | 8 | 8 | 4 | 14765 | 64.519 | 7.30 |
2 | 8 | 16 | 4 | 14661 | 74.224 | 8.40 |
4 | 8 | 32 | 4 | 14673 | 143.232 | 16.21 |
1 | 1 | 1 | 6 | 15779 | 9.180 | 1.04 |