Skip to content

Latest commit



156 lines (102 loc) · 5.24 KB

File metadata and controls

156 lines (102 loc) · 5.24 KB




  • Ubuntu 18.04 LTS
  • CUDA 10.2
  • cuDNN 7.6.5
  • NCCL 2.7.6
  • Python3.6
  • g++ 7.5
  • torch==1.7.0
  • torchvision==0.8.1
  • mxnet-cu102==1.7.0 (we use this build)
  • gluoncv==0.8.0

We recommend using Amazon Deep Learning AMI, which has pre-installed most of the above libraries.

Compile BytePS

Before compiling BytePS, please specify NCCL home directory. Otherwise, PyTorch extension might fail to compile.

cd byteps
export BYTEPS_NCCL_HOME=/usr/local/cuda
python3 install --user

You can check the install by running

python3 -c "import byteps.mxnet" 
python3 -c "import byteps.torch"

If there is no error message, then it works.


All experiments are conducted on Amazon EC2 P3.16xlarge instances, each equipped with 8 16GB V100 GPUs and 25Gbps Ethernet.

Please prepare the ip list of your hosts in worker-hosts and server-hosts file. For distributed training, it should contain multiple lines of ip. Since PS workers and servers are co-located in each node, two host files can the same. If you want to launch twice as many servers, just copy the ip list twice in server-hosts.

You may also need to prepare a pem file to ssh to other nodes without password.


We use ImageRecord format to store the dataset. Please refer to GluonCV's document for more details.

We evaluate two representive CNN models: ResNet50_v2, and VGG-16.


Training script comes from GluonCV, with a small modification to adopt gradient compression.

To train, run

cd example/mxnet
./ baseline 0.2

For ResNet50, we train with 8 Amazon EC2 P3.16xlarge. For VGG16, we train with 4 Amazon EC2 P3.16xlage.


The base LR refers to the learning rate with single node. We follow Linear Scaling Rule proposed in Accurate, Large Minibatch SGD:Training ImageNet in 1 Hour. For example, if the base LR is 0.2, then with 8 Amazon EC2 P3.16xlarge, the actual learning rate should be 0.2 * 8 = 1.6. We only change learning rate. We use the default values for other hyper-parameters.


We use a smaller learning rate for 1-bit, top-k and random-k. The total batch size is 4k.

Algorithm base LR
NAG (FP32) 0.2
NAG (FP16) 0.2
Scaled 1-bit with EF 0.1
Top-k (k=0.1%) with EF 0.1
Random-k (k=1/32) with EF 0.1
Linear Dithering (5 bits) 0.2
Natural Dithering (3 bits) 0.2


We use a larger learning rate for 1-bit, top-k and random-k. The total batch size is 1k.

Algorithm base LR
NAG (FP32) 0.01
NAG (FP16) 0.01
Scaled 1-bit with EF 0.015
Top-k (k=0.1%) with EF 0.015
Random-k (k=1/32) with EF 0.015
Linear Dithering (5 bits) 0.01
Natural Dithering (3 bits) 0.01


The code comes from NVIDIA DeepLearningExamples, with a small modification to accommodate gradient compression.


We use mixed precision to accelerate pretraining with apex package. To install apex, run

cd apex 
pip3 install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ 

We provide a highly-optimized version of LANS optimizer written in CUDA (code).

The preprocessed dataset is partitioned into 1536 shards.

For BERT-base, we train with 4 Amazon EC2 P3.16xlarge.


Following LAMB, we use Sqare Root Scaling Rule.

We use a larger learning rate for 1-bit in phase 1 and smaller learning rate in phase 2. For top-k, we use a smaller learning rate in both phases. We only change learning rate. We use the default values for other hyper-parameters.

Total batch size is 2k.

Algorithm LR (phase 1) LR (phase 2)
LANS 0.00125 0.00125
CLAN(Scaled 1-bit with EF) 0.0014865 0.001051
CLAN(Top-k (k=0.1%) with EF) 0.001051 0.001051
CLAN(Linear Dithering (7 bits)) 0.00125 0.00125



We evaluate on SQuAD v1.1. We do not change any hyper-parameters.

To finetune SQuAD, run



By default, we use a batch size of 32 and train for 3 epochs. But for MRPC dataset, it is a tiny dataset. We train it for 5 epochs. We find that in some cases, the loss might be NaN. If that happens, we will use FP32 to finetune instead of FP16.

To finetune GLUE, run
