-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add examples of running MXNet with Horovod #14286
Conversation
for epoch in range(num_epoch): | ||
train_data.reset() | ||
for nbatch, batch in enumerate(train_data, start=1): | ||
data = gluon.utils.split_and_load(batch.data[0], ctx_list=[context], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to sync with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@@ -0,0 +1,456 @@ | |||
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you just copy the example from Horovod here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated.
@@ -0,0 +1,142 @@ | |||
# Step 0: import required packages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add copyright
# Example | ||
|
||
Here we provide the building blocks to train a model using MXNet with Horovod. | ||
The full examples are in [MINST](mxnet_mnist.py) and [ImageNet](mxnet_imagenet_resnet50.py). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also MINST should be changed to MNIST on line 84.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one for gluon and one for module in separate files.
@mxnet-label-bot add [pr-awaiting-review] |
|
||
1. Run `hvd.init()`. | ||
|
||
2. Pin a server GPU to the context using `context = mx.gpu(hvd.local_rank())`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the CPU is more widely used and easy to access.
Could we make a general example/readme for both CPU and GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pengzhao-intel I know CPU is more widely used for inference. But is that true for training? CPU is much much slower than GPU in training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understand your points :)
Since this is the example, I think we can focus on better usability and portability. Performance may be the second factor. And user can set up the env and do simple debug/testing on the local CPU for their algorithm. After everything is fine, they can distribute training with more GPUs or other devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. We should also mention CPU here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with CPU mention
acc_top1 = mx.metric.Accuracy() | ||
acc_top5 = mx.metric.TopKAccuracy(5) | ||
for _, batch in enumerate(val_data): | ||
data, label = batch_fn(batch, [context]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this example the same way in horovod/horovod#872
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
@pengzhao-intel @eric-haibin-lin @yuxihu @ctcyang Addressed your comments. Please help to review again. Thanks! |
@wuxun-zhang could you try to run the example followed with this tutorial? |
@pengzhao-intel I have already run the example mxnet_inagenet_resnet50.py in horovod repo and I think these two examples are almost the same. I can retry this example on multi-CPU platform. |
@wuxun-zhang thanks, I believe the example can run smoothly but I think we can check if the doc is easy to reproduce for the newbie. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments. LGTM overall.
If you're installing Horovod on a server with GPUs, read the [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.md) page. | ||
If you want to use Docker, read the [Horovod in Docker](https://github.com/horovod/horovod/blob/master/docs/docker.md) page. | ||
|
||
## Install Open MPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just say install MPI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
# Install | ||
## Install MXNet | ||
```bash | ||
$ pip install mxnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we mention that 1.4.0 mkldnn packages do not work with horovod 0.16.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MXNet pip package does not contain mkldnn by default in 1.4.0. I think it is okay here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to mention it in the Install MXNet section. Here we just use mxnet package as an example. Users may choose their own packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
|
||
## What's New? | ||
Compared with the standard distributed training script in MXNet which uses parameter server to | ||
distribute and aggregate parameters, Horovod uses ring allreduce algorithm to communicate parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might change this to "ring allreduce and tree-based allreduce algorithm", because Horovod will use the tree-based MPI allreduce algorithm if you set HIERARCHICAL_ALLREDUCE=1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the review. updated.
@wuxun-zhang Any issue with running the example in CPU following this document? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I built MXNet from this commit by using GCC 5.3.1-6. When I built Horovod from source using |
@wuxun-zhang when you build MXNet from source, did you enable MKLDNN? Horovod 0.16.0 release does not work with MKLDNN enabled libmxnet.so. Our fix went in after the release. |
@yuxihu Thanks for reminding. I have re-installed MXNet without MKLDNN by using the command |
@wuxun-zhang When building from source, I think you need to run |
@apeforest I think @wuxun-zhang wanted to test with the Horovod PyPi package. @wuxun-zhang The undefined symbol is not related to MPI. Can you try with the latest MXNet? The one you were using was from January. |
@apeforest There are no problem when building Horovod from source. Just want to verify if Horovod PyPi package can also work well. @yuxihu I have tried the lastest MXNet with this commit. When I |
@wuxun-zhang Don't have problem running on my MacBook.
My environment:
|
@apeforest ready to merge this one? |
@wuxun-zhang Do you still have problem running the example on CPU following this guide? |
@apeforest @yuxihu I tried the lastest MXNet repo and ran Horovod (using PyPi package) successfully on CPU. Many thanks for your help. LGTM |
@eric-haibin-lin Could you please help to review or merge this PR if no other concern? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
```bash | ||
$ pip install mxnet | ||
``` | ||
**Note**: There is a [known issue](https://github.com/horovod/horovod/issues/884) when running Horovod with MXNet on a Linux system with GCC version 5.X and above. We recommend users to build MXNet from source following this [guide](https://mxnet.incubator.apache.org/install/build_from_source.html) as a workaround for now. Also mxnet-mkl package in 1.4.0 release does not support Horovod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so currently pip install doesn't work for this use case ? Is this glibc incompatibility ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It's not because of glibc incompability but due to the GCC4 and GCC5 std::function signature change. In MXNet-Horovod integration, we passed a std::function as callback from Horovod to MXNet. When Horovod and MXNet are built with different GCC versions, segmentation fault will occurr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how are the pips built ? for which gcc version ? does pip have this issue currently ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MXNet pip is built with gcc4. If user builds Horovod on centos7/ubuntu14.0, there will be no issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these steps dont currently work. I would suggest changing this to easiest path currently available : 1. build mxnet with gcc 5 followed by pip install horovod OR 2. pip install mxnet followed by build horovod with gcc4 build. I feel 1 is easier for users. When we fix this bug then we can modify documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which platform are you installing?
The following steps in the README work for me on both MacOS and Amazon Linux and Centos 7 (all gcc4)
pip install mxnet
pip install horovod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm! i think i misunderstood earlier.
hvd.init() | ||
|
||
# Set context to current process | ||
context = mx.cpu(hvd.local_rank()) if args.no_cuda else mx.gpu(hvd.local_rank()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think args is not defined yet. Maybe context.num_gpus()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a code skeleton to showcase the usage. The args is defined in the real example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the examples tested?
* Add examples for MXNet with Horovod * update readme * update examples * update README * update mnist_module example * Update README * update README * update README * update README
* Add examples for MXNet with Horovod * update readme * update examples * update README * update mnist_module example * Update README * update README * update README * update README
* Add examples for MXNet with Horovod * update readme * update examples * update README * update mnist_module example * Update README * update README * update README * update README
* Add examples for MXNet with Horovod * update readme * update examples * update README * update mnist_module example * Update README * update README * update README * update README
Description
Added a mnist and an imagenet example to show how to run MXNet with Horovod. README page is also added.
Changes