Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some benchmarking results #2

Open
markmp opened this issue Sep 13, 2018 · 1 comment
Open

Some benchmarking results #2

markmp opened this issue Sep 13, 2018 · 1 comment

Comments

@markmp
Copy link

markmp commented Sep 13, 2018

For anyone who wants to sanity check, these are the results I was able to achieve with the Tensorflow example in this repo. All of these were run using the DL AMI (Version 13, Ubuntu/Conda) on p3 instances. This was using the default params (90 epochs, 256 batch size, etc.).

1x8 (8 GPUs on single p3.16xlarge node): training took 6 hrs and 12 minutes, 75.421% (top-1), 92.660% (top-5)

2x8 (16 GPUs across 2 p3.16xlarge nodes): training took 3 hrs 16 minutes, 74.868% (top-1), 92.398% (Top-5)

8x8 (64 GPUs total ): training took 54 minutes - 75.404% (top-1 accuracy), 92.65% (top-5 accuracy).

And this was the training bandwidth I was able to achieve. Each machine had it's on copy of data on 256 GB gp2 EBS volume. (did not use the BeeGFS filesystem here) (One test did use a ramdisk, which didn't make much of a difference)

1x1 GPU: 740 img/sec
1x2 GPU: 1481.3 img/sec
1x8 GPU: 5000 img/sec
1x8 GPU: 5100-5200 (ramdisk seems 1-2% faster - but barely noticeable)
2x8 GPU (16 GPUs total): 9860 imgs/sec, (83% efficiency)
4x8 GPU (32 GPUs total): 21300 imgs/sec (with bind_to slot option - aws said might be faster)
8x8 GPU (64 GPUs total): ~37000 imgs/sec. (with bind_to slot option 79% efficient)

These efficiencies were pretty good - but the 64 gpu test definitely did not get the 90% reported in the blog post. I got closer to 79%. Curious if there were any other params/settings that you think would help.

My MPI run command was as follows:
mpirun -np 64 --hostfile hosts -mca plm_rsh_no_tree_spawn 1 --bind-to socket --map-by slot -x NCCL_DEBUG=INFO -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=ens3 -mca orte_base_help_aggregate 0 -mca pml ob1 -mca btl_tcp_if_exclude lo,docker0 -mca btl ^openib /home/ubuntu/examples/horovod/cnn/wrapenv.sh python aws_tf_hvd_cnn.py --batch_size=256 --num_epochs=90 --fp16 --data_dir /home/ubuntu/imagenet/train-resized --model resnet50 --log_dir results_2x8gpu_test1 --display_every 100 --save_interval=3600

@rahul003
Copy link
Collaborator

rahul003 commented Oct 16, 2018

We found that using Cuda 9.2 is what gave the results a significant boost. Since the official binaries of TF do not use Cuda 9.2, we let the DLAMI remain at cuda9.0 till the official binaries get updated.

For now to get the latest speeds on DLAMI, we encourage you to use these additional flags to make use of some improvements we made to Horovod (. -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 .
The first flag uses NCCL for reduction within a node, and MPI for the reduction across nodes instead of NCCL everywhere. We recommend using HOROVOD_HIERARCHICAL_ALLREDUCE for all models. Requires Horovod>=0.14.1 which is available in the newer versions of DLAMI.
The second flag adjusts the fusion threshold to 16MB for ResNet50. This might need to be tuned for a given model.

piyushghai added a commit to piyushghai/deep-learning-models that referenced this issue Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants