Some benchmarking results #2

markmp · 2018-09-13T23:30:51Z

For anyone who wants to sanity check, these are the results I was able to achieve with the Tensorflow example in this repo. All of these were run using the DL AMI (Version 13, Ubuntu/Conda) on p3 instances. This was using the default params (90 epochs, 256 batch size, etc.).

1x8 (8 GPUs on single p3.16xlarge node): training took 6 hrs and 12 minutes, 75.421% (top-1), 92.660% (top-5)

2x8 (16 GPUs across 2 p3.16xlarge nodes): training took 3 hrs 16 minutes, 74.868% (top-1), 92.398% (Top-5)

8x8 (64 GPUs total ): training took 54 minutes - 75.404% (top-1 accuracy), 92.65% (top-5 accuracy).

And this was the training bandwidth I was able to achieve. Each machine had it's on copy of data on 256 GB gp2 EBS volume. (did not use the BeeGFS filesystem here) (One test did use a ramdisk, which didn't make much of a difference)

1x1 GPU: 740 img/sec
1x2 GPU: 1481.3 img/sec
1x8 GPU: 5000 img/sec
1x8 GPU: 5100-5200 (ramdisk seems 1-2% faster - but barely noticeable)
2x8 GPU (16 GPUs total): 9860 imgs/sec, (83% efficiency)
4x8 GPU (32 GPUs total): 21300 imgs/sec (with bind_to slot option - aws said might be faster)
8x8 GPU (64 GPUs total): ~37000 imgs/sec. (with bind_to slot option 79% efficient)

These efficiencies were pretty good - but the 64 gpu test definitely did not get the 90% reported in the blog post. I got closer to 79%. Curious if there were any other params/settings that you think would help.

My MPI run command was as follows:
mpirun -np 64 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 --bind-to socket --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=ens3 -mca orte_base_help_aggregate 0 -mca pml ob1 -mca btl_tcp_if_exclude lo,docker0 -mca btl ^openib /home/ubuntu/examples/horovod/cnn/wrapenv.sh python aws_tf_hvd_cnn.py --batch_size=256 --num_epochs=90 --fp16 --data_dir /home/ubuntu/imagenet/train-resized --model resnet50 --log_dir results_2x8gpu_test1 --display_every 100 --save_interval=3600

rahul003 · 2018-10-16T01:35:01Z

We found that using Cuda 9.2 is what gave the results a significant boost. Since the official binaries of TF do not use Cuda 9.2, we let the DLAMI remain at cuda9.0 till the official binaries get updated.

For now to get the latest speeds on DLAMI, we encourage you to use these additional flags to make use of some improvements we made to Horovod (. -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 .
The first flag uses NCCL for reduction within a node, and MPI for the reduction across nodes instead of NCCL everywhere. We recommend using HOROVOD_HIERARCHICAL_ALLREDUCE for all models. Requires Horovod>=0.14.1 which is available in the newer versions of DLAMI.
The second flag adjusts the fusion threshold to 16MB for ResNet50. This might need to be tuned for a given model.

TF2 Albert Herring Support

piyushghai added a commit to piyushghai/deep-learning-models that referenced this issue Dec 23, 2020

Merge pull request aws-samples#2 from HerringForks/albert

188e14c

TF2 Albert Herring Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some benchmarking results #2

Some benchmarking results #2

markmp commented Sep 13, 2018

rahul003 commented Oct 16, 2018 •

edited

Loading

Some benchmarking results #2

Some benchmarking results #2

Comments

markmp commented Sep 13, 2018

rahul003 commented Oct 16, 2018 • edited Loading

rahul003 commented Oct 16, 2018 •

edited

Loading