You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For anyone who wants to sanity check, these are the results I was able to achieve with the Tensorflow example in this repo. All of these were run using the DL AMI (Version 13, Ubuntu/Conda) on p3 instances. This was using the default params (90 epochs, 256 batch size, etc.).
1x8 (8 GPUs on single p3.16xlarge node): training took 6 hrs and 12 minutes, 75.421% (top-1), 92.660% (top-5)
2x8 (16 GPUs across 2 p3.16xlarge nodes): training took 3 hrs 16 minutes, 74.868% (top-1), 92.398% (Top-5)
8x8 (64 GPUs total ): training took 54 minutes - 75.404% (top-1 accuracy), 92.65% (top-5 accuracy).
And this was the training bandwidth I was able to achieve. Each machine had it's on copy of data on 256 GB gp2 EBS volume. (did not use the BeeGFS filesystem here) (One test did use a ramdisk, which didn't make much of a difference)
These efficiencies were pretty good - but the 64 gpu test definitely did not get the 90% reported in the blog post. I got closer to 79%. Curious if there were any other params/settings that you think would help.
We found that using Cuda 9.2 is what gave the results a significant boost. Since the official binaries of TF do not use Cuda 9.2, we let the DLAMI remain at cuda9.0 till the official binaries get updated.
For now to get the latest speeds on DLAMI, we encourage you to use these additional flags to make use of some improvements we made to Horovod (. -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 .
The first flag uses NCCL for reduction within a node, and MPI for the reduction across nodes instead of NCCL everywhere. We recommend using HOROVOD_HIERARCHICAL_ALLREDUCE for all models. Requires Horovod>=0.14.1 which is available in the newer versions of DLAMI.
The second flag adjusts the fusion threshold to 16MB for ResNet50. This might need to be tuned for a given model.
piyushghai
added a commit
to piyushghai/deep-learning-models
that referenced
this issue
Dec 23, 2020
For anyone who wants to sanity check, these are the results I was able to achieve with the Tensorflow example in this repo. All of these were run using the DL AMI (Version 13, Ubuntu/Conda) on p3 instances. This was using the default params (90 epochs, 256 batch size, etc.).
1x8 (8 GPUs on single p3.16xlarge node): training took 6 hrs and 12 minutes, 75.421% (top-1), 92.660% (top-5)
2x8 (16 GPUs across 2 p3.16xlarge nodes): training took 3 hrs 16 minutes, 74.868% (top-1), 92.398% (Top-5)
8x8 (64 GPUs total ): training took 54 minutes - 75.404% (top-1 accuracy), 92.65% (top-5 accuracy).
And this was the training bandwidth I was able to achieve. Each machine had it's on copy of data on 256 GB gp2 EBS volume. (did not use the BeeGFS filesystem here) (One test did use a ramdisk, which didn't make much of a difference)
1x1 GPU: 740 img/sec
1x2 GPU: 1481.3 img/sec
1x8 GPU: 5000 img/sec
1x8 GPU: 5100-5200 (ramdisk seems 1-2% faster - but barely noticeable)
2x8 GPU (16 GPUs total): 9860 imgs/sec, (83% efficiency)
4x8 GPU (32 GPUs total): 21300 imgs/sec (with bind_to slot option - aws said might be faster)
8x8 GPU (64 GPUs total): ~37000 imgs/sec. (with bind_to slot option 79% efficient)
These efficiencies were pretty good - but the 64 gpu test definitely did not get the 90% reported in the blog post. I got closer to 79%. Curious if there were any other params/settings that you think would help.
My MPI run command was as follows:
mpirun -np 64 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 --bind-to socket --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=ens3 -mca orte_base_help_aggregate 0 -mca pml ob1 -mca btl_tcp_if_exclude lo,docker0 -mca btl ^openib /home/ubuntu/examples/horovod/cnn/wrapenv.sh python aws_tf_hvd_cnn.py --batch_size=256 --num_epochs=90 --fp16 --data_dir /home/ubuntu/imagenet/train-resized --model resnet50 --log_dir results_2x8gpu_test1 --display_every 100 --save_interval=3600
The text was updated successfully, but these errors were encountered: