Add examples of running MXNet with Horovod #14286

apeforest · 2019-02-28T22:21:06Z

Description

Added a mnist and an imagenet example to show how to run MXNet with Horovod. README page is also added.

Changes

README
mxnet_mnist.py
mxnet_imagenet.py

apeforest · 2019-02-28T22:22:11Z

@yuxihu @rahul003 @ctcyang Please help to review.

yuxihu · 2019-02-28T22:31:43Z

example/distributed_training-horovod/README.md

+for epoch in range(num_epoch):
+    train_data.reset()
+    for nbatch, batch in enumerate(train_data, start=1):
+        data = gluon.utils.split_and_load(batch.data[0], ctx_list=[context],


You may want to sync with this PR.

yuxihu · 2019-02-28T22:32:09Z

example/distributed_training-horovod/mxnet_imagenet_resnet50.py

@@ -0,0 +1,456 @@
+# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.


do you just copy the example from Horovod here?

Thanks. Updated.

yuxihu · 2019-02-28T22:33:10Z

example/distributed_training-horovod/mxnet_mnist.py

@@ -0,0 +1,142 @@
+# Step 0: import required packages


Add copyright

yuxihu · 2019-02-28T22:36:01Z

example/distributed_training-horovod/README.md

+# Example
+
+Here we provide the building blocks to train a model using MXNet with Horovod.
+The full examples are in [MINST](mxnet_mnist.py) and [ImageNet](mxnet_imagenet_resnet50.py).


For the MNIST example, can we have one for gluon and one for module? You may consider to use the code I prepared for the meetup: gluon and module

Also MINST should be changed to MNIST on line 84.

Added one for gluon and one for module in separate files.

anirudhacharya · 2019-03-01T02:33:42Z

@mxnet-label-bot add [pr-awaiting-review]

pengzhao-intel · 2019-03-01T02:43:37Z

example/distributed_training-horovod/README.md

+
+1. Run `hvd.init()`.
+
+2. Pin a server GPU to the context using `context = mx.gpu(hvd.local_rank())`.


Because the CPU is more widely used and easy to access.
Could we make a general example/readme for both CPU and GPU?

@pengzhao-intel I know CPU is more widely used for inference. But is that true for training? CPU is much much slower than GPU in training.

Understand your points :)

Since this is the example, I think we can focus on better usability and portability. Performance may be the second factor. And user can set up the env and do simple debug/testing on the local CPU for their algorithm. After everything is fine, they can distribute training with more GPUs or other devices.

Agree. We should also mention CPU here.

Updated with CPU mention

eric-haibin-lin · 2019-03-02T23:10:33Z

example/distributed_training-horovod/mxnet_imagenet_resnet50.py

+        acc_top1 = mx.metric.Accuracy()
+        acc_top5 = mx.metric.TopKAccuracy(5)
+        for _, batch in enumerate(val_data):
+            data, label = batch_fn(batch, [context])


Please update this example the same way in horovod/horovod#872

apeforest · 2019-03-08T21:10:08Z

@pengzhao-intel @eric-haibin-lin @yuxihu @ctcyang Addressed your comments. Please help to review again. Thanks!

pengzhao-intel · 2019-03-12T00:51:31Z

@wuxun-zhang could you try to run the example followed with this tutorial?

wuxun-zhang · 2019-03-12T13:45:32Z

@pengzhao-intel I have already run the example mxnet_inagenet_resnet50.py in horovod repo and I think these two examples are almost the same. I can retry this example on multi-CPU platform.

pengzhao-intel · 2019-03-12T13:51:44Z

@wuxun-zhang thanks, I believe the example can run smoothly but I think we can check if the doc is easy to reproduce for the newbie.

yuxihu

Minor comments. LGTM overall.

yuxihu · 2019-03-12T15:44:32Z

example/distributed_training-horovod/README.md

+If you're installing Horovod on a server with GPUs, read the [Horovod on GPU](https://github.com/horovod/horovod/blob/master/docs/gpus.md) page.
+If you want to use Docker, read the [Horovod in Docker](https://github.com/horovod/horovod/blob/master/docs/docker.md) page.
+
+## Install Open MPI


Shall we just say install MPI?

yuxihu · 2019-03-12T15:51:45Z

example/distributed_training-horovod/README.md

+# Install
+## Install MXNet
+```bash
+$ pip install mxnet


shall we mention that 1.4.0 mkldnn packages do not work with horovod 0.16.0?

MXNet pip package does not contain mkldnn by default in 1.4.0. I think it is okay here.

I meant to mention it in the Install MXNet section. Here we just use mxnet package as an example. Users may choose their own packages.

ctcyang · 2019-03-12T19:56:29Z

example/distributed_training-horovod/README.md

+
+## What's New?
+Compared with the standard distributed training script in MXNet which uses parameter server to 
+distribute and aggregate parameters, Horovod uses ring allreduce algorithm to communicate parameters 


I might change this to "ring allreduce and tree-based allreduce algorithm", because Horovod will use the tree-based MPI allreduce algorithm if you set HIERARCHICAL_ALLREDUCE=1.

thanks for the review. updated.

apeforest · 2019-03-14T18:33:25Z

@wuxun-zhang Any issue with running the example in CPU following this document? Thanks

yuxihu

LGTM.

wuxun-zhang · 2019-03-15T04:28:36Z

I built MXNet from this commit by using GCC 5.3.1-6. When I built Horovod from source using pip install --no-cache-dir -v [horovod_repo_dir], there are no issues. However, when I install Horovod by pip install horovod directly, I got an error like OSError: /home/wuxunzha/anaconda3/envs/conda3_official_horovod_fp32_training/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZTINSt6thread6_StateE. I saw there are some existing issues like 656. I also checked the LD_LIBRARY_PATH, but still have the same error.

wuxun-zhang · 2019-03-15T04:29:33Z

@apeforest

yuxihu · 2019-03-15T05:49:12Z

@wuxun-zhang when you build MXNet from source, did you enable MKLDNN? Horovod 0.16.0 release does not work with MKLDNN enabled libmxnet.so. Our fix went in after the release.

wuxun-zhang · 2019-03-15T11:49:31Z

@yuxihu Thanks for reminding. I have re-installed MXNet without MKLDNN by using the command make USE_OPENCV=1 USE_MKLDNN=0 USE_BLAS=openblas -j. And using the default command pip install horovod to install horovod. When I run python ~/github/incubator-mxnet/example/distributed_training-horovod/resnet50_imagenet.py --no-cuda , there still have an error, OSError: /home/wuxunzha/anaconda3/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNK5mxnet7NDArray7ReshapeERKNS_6TShapeE. I don't know if it is actually related to MPI.
Note: MXNet is built based on GCC 5.3.1-6.

apeforest · 2019-03-15T17:14:19Z

@wuxun-zhang When building from source, I think you need to run pip install --no-cache-dir . in the horovod repo. pip install horovod is installing from PyPi release.

yuxihu · 2019-03-15T21:18:26Z

@apeforest I think @wuxun-zhang wanted to test with the Horovod PyPi package.

@wuxun-zhang The undefined symbol is not related to MPI. Can you try with the latest MXNet? The one you were using was from January.

wuxun-zhang · 2019-03-16T09:02:37Z

@apeforest There are no problem when building Horovod from source. Just want to verify if Horovod PyPi package can also work well.

@yuxihu I have tried the lastest MXNet with this commit. When I import horovod.mxnet as hvd, still got the error undefined symbol. Did you run this example successfully on CPU? If so, can you tell me what's your building command for MXNet without mkldnn? Thanks in advance.

apeforest · 2019-03-18T18:33:43Z

@wuxun-zhang Don't have problem running on my MacBook.
Repro:

cp make/osx.mk config.mk
make -j8
pip install horovod
python example/distributed_training-horovod/resnet50_imagenet.py --no-cuda

My environment:

----------Python Info----------
Version      : 3.7.2
Compiler     : Clang 10.0.0 (clang-1000.11.45.5)
Build        : ('default', 'Feb 12 2019 08:16:38')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /Users/lnyuan/.virtualenvs/mxnet/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /Users/lnyuan/work/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-17.7.0-x86_64-i386-64bit
system       : Darwin
node         : 88e9fe759c49.ant.amazon.com
release      : 17.7.0
version      : Darwin Kernel Version 17.7.0: Thu Dec 20 21:47:19 PST 2018; root:xnu-4570.71.22~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0025 sec, LOAD: 1.1295 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0231 sec, LOAD: 0.3325 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0312 sec, LOAD: 0.4660 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0109 sec, LOAD: 0.3440 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0192 sec, LOAD: 0.8956 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0214 sec, LOAD: 0.0652 sec.

yuxihu · 2019-03-20T16:35:53Z

@apeforest ready to merge this one?

apeforest · 2019-03-20T17:34:35Z

@wuxun-zhang Do you still have problem running the example on CPU following this guide?

wuxun-zhang · 2019-03-21T02:36:39Z

@apeforest @yuxihu I tried the lastest MXNet repo and ran Horovod (using PyPi package) successfully on CPU. Many thanks for your help.

LGTM

apeforest · 2019-03-21T04:09:05Z

@eric-haibin-lin Could you please help to review or merge this PR if no other concern?

pengzhao-intel

LGTM

anirudh2290 · 2019-03-21T17:46:20Z

example/distributed_training-horovod/README.md

+```bash
+$ pip install mxnet
+```
+**Note**: There is a [known issue](https://github.com/horovod/horovod/issues/884) when running Horovod with MXNet on a Linux system with GCC version 5.X and above. We recommend users to build MXNet from source following this [guide](https://mxnet.incubator.apache.org/install/build_from_source.html) as a workaround for now. Also mxnet-mkl package in 1.4.0 release does not support Horovod.


so currently pip install doesn't work for this use case ? Is this glibc incompatibility ?

No. It's not because of glibc incompability but due to the GCC4 and GCC5 std::function signature change. In MXNet-Horovod integration, we passed a std::function as callback from Horovod to MXNet. When Horovod and MXNet are built with different GCC versions, segmentation fault will occurr.

how are the pips built ? for which gcc version ? does pip have this issue currently ?

MXNet pip is built with gcc4. If user builds Horovod on centos7/ubuntu14.0, there will be no issue.

these steps dont currently work. I would suggest changing this to easiest path currently available : 1. build mxnet with gcc 5 followed by pip install horovod OR 2. pip install mxnet followed by build horovod with gcc4 build. I feel 1 is easier for users. When we fix this bug then we can modify documentation.

Which platform are you installing?
The following steps in the README work for me on both MacOS and Amazon Linux and Centos 7 (all gcc4)

pip install mxnet pip install horovod

nvm! i think i misunderstood earlier.

larroy · 2019-03-22T21:58:44Z

example/distributed_training-horovod/README.md

+hvd.init()
+
+# Set context to current process 
+context = mx.cpu(hvd.local_rank()) if args.no_cuda else mx.gpu(hvd.local_rank())


I think args is not defined yet. Maybe context.num_gpus()?

This is just a code skeleton to showcase the usage. The args is defined in the real example.

larroy

Are the examples tested?

* Add examples for MXNet with Horovod * update readme * update examples * update README * update mnist_module example * Update README * update README * update README * update README

apeforest added 2 commits February 28, 2019 13:15

Add examples for MXNet with Horovod

9e20ebb

update readme

2fd104d

apeforest requested a review from szha as a code owner February 28, 2019 22:21

apeforest requested a review from eric-haibin-lin February 28, 2019 22:21

yuxihu suggested changes Feb 28, 2019

View reviewed changes

marcoabreu added the pr-awaiting-review PR is waiting for code review label Mar 1, 2019

pengzhao-intel reviewed Mar 1, 2019

View reviewed changes

eric-haibin-lin reviewed Mar 2, 2019

View reviewed changes

apeforest added 4 commits March 7, 2019 10:36

update examples

7ff1230

Merge remote-tracking branch 'upstream/master' into example/horovod

7f5ef5e

update README

bb37d68

update mnist_module example

a9ec9fa

apeforest added 3 commits March 8, 2019 13:43

Update README

d91f09e

Merge remote-tracking branch 'upstream/master' into example/horovod

e544447

Merge remote-tracking branch 'upstream/master' into example/horovod

347be05

yuxihu approved these changes Mar 12, 2019

View reviewed changes

update README

67171a8

ctcyang reviewed Mar 12, 2019

View reviewed changes

apeforest added 2 commits March 13, 2019 16:16

update README

c51df6d

update README

666c4f2

yuxihu approved these changes Mar 14, 2019

View reviewed changes

pengzhao-intel approved these changes Mar 21, 2019

View reviewed changes

anirudh2290 reviewed Mar 21, 2019

View reviewed changes

anirudh2290 approved these changes Mar 22, 2019

View reviewed changes

anirudh2290 merged commit 056fce4 into apache:master Mar 22, 2019

larroy reviewed Mar 22, 2019

View reviewed changes

apeforest deleted the example/horovod branch August 23, 2019 17:09

		@@ -0,0 +1,456 @@
		# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.


		1. Run `hvd.init()`.

		2. Pin a server GPU to the context using `context = mx.gpu(hvd.local_rank())`.

Add examples of running MXNet with Horovod #14286

Add examples of running MXNet with Horovod #14286

Conversation

apeforest commented Feb 28, 2019

Description

Changes

apeforest commented Feb 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anirudhacharya commented Mar 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Mar 8, 2019

pengzhao-intel commented Mar 12, 2019

wuxun-zhang commented Mar 12, 2019

pengzhao-intel commented Mar 12, 2019

yuxihu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

apeforest commented Mar 14, 2019

yuxihu left a comment

Choose a reason for hiding this comment

wuxun-zhang commented Mar 15, 2019

wuxun-zhang commented Mar 15, 2019

yuxihu commented Mar 15, 2019

wuxun-zhang commented Mar 15, 2019

apeforest commented Mar 15, 2019 • edited Loading

yuxihu commented Mar 15, 2019

wuxun-zhang commented Mar 16, 2019

apeforest commented Mar 18, 2019 • edited Loading

yuxihu commented Mar 20, 2019

apeforest commented Mar 20, 2019

wuxun-zhang commented Mar 21, 2019

apeforest commented Mar 21, 2019

pengzhao-intel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

larroy left a comment

Choose a reason for hiding this comment

apeforest commented Mar 15, 2019 •

edited

Loading

apeforest commented Mar 18, 2019 •

edited

Loading