Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add examples of running MXNet with Horovod #14286

Merged
merged 12 commits into from
Mar 22, 2019

Conversation

apeforest
Copy link
Contributor

Description

Added a mnist and an imagenet example to show how to run MXNet with Horovod. README page is also added.

Changes

  • README
  • mxnet_mnist.py
  • mxnet_imagenet.py

@apeforest
Copy link
Contributor Author

@yuxihu @rahul003 @ctcyang Please help to review.

for epoch in range(num_epoch):
train_data.reset()
for nbatch, batch in enumerate(train_data, start=1):
data = gluon.utils.split_and_load(batch.data[0], ctx_list=[context],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to sync with this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -0,0 +1,456 @@
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you just copy the example from Horovod here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

@@ -0,0 +1,142 @@
# Step 0: import required packages
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add copyright

# Example

Here we provide the building blocks to train a model using MXNet with Horovod.
The full examples are in [MINST](mxnet_mnist.py) and [ImageNet](mxnet_imagenet_resnet50.py).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the MNIST example, can we have one for gluon and one for module? You may consider to use the code I prepared for the meetup: gluon and module

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also MINST should be changed to MNIST on line 84.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one for gluon and one for module in separate files.

@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Mar 1, 2019

1. Run `hvd.init()`.

2. Pin a server GPU to the context using `context = mx.gpu(hvd.local_rank())`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the CPU is more widely used and easy to access.
Could we make a general example/readme for both CPU and GPU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pengzhao-intel I know CPU is more widely used for inference. But is that true for training? CPU is much much slower than GPU in training.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand your points :)

Since this is the example, I think we can focus on better usability and portability. Performance may be the second factor. And user can set up the env and do simple debug/testing on the local CPU for their algorithm. After everything is fine, they can distribute training with more GPUs or other devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We should also mention CPU here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with CPU mention

acc_top1 = mx.metric.Accuracy()
acc_top5 = mx.metric.TopKAccuracy(5)
for _, batch in enumerate(val_data):
data, label = batch_fn(batch, [context])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this example the same way in horovod/horovod#872

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@apeforest
Copy link
Contributor Author

@pengzhao-intel @eric-haibin-lin @yuxihu @ctcyang Addressed your comments. Please help to review again. Thanks!

@pengzhao-intel
Copy link
Contributor

@wuxun-zhang could you try to run the example followed with this tutorial?

@wuxun-zhang
Copy link
Contributor

@pengzhao-intel I have already run the example mxnet_inagenet_resnet50.py in horovod repo and I think these two examples are almost the same. I can retry this example on multi-CPU platform.

@pengzhao-intel
Copy link
Contributor

@wuxun-zhang thanks, I believe the example can run smoothly but I think we can check if the doc is easy to reproduce for the newbie.

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM overall.

If you're installing Horovod on a server with GPUs, read the [Horovod on GPU](/~https://github.com/horovod/horovod/blob/master/docs/gpus.md) page.
If you want to use Docker, read the [Horovod in Docker](/~https://github.com/horovod/horovod/blob/master/docs/docker.md) page.

## Install Open MPI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just say install MPI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

# Install
## Install MXNet
```bash
$ pip install mxnet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mention that 1.4.0 mkldnn packages do not work with horovod 0.16.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MXNet pip package does not contain mkldnn by default in 1.4.0. I think it is okay here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to mention it in the Install MXNet section. Here we just use mxnet package as an example. Users may choose their own packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.


## What's New?
Compared with the standard distributed training script in MXNet which uses parameter server to
distribute and aggregate parameters, Horovod uses ring allreduce algorithm to communicate parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change this to "ring allreduce and tree-based allreduce algorithm", because Horovod will use the tree-based MPI allreduce algorithm if you set HIERARCHICAL_ALLREDUCE=1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review. updated.

@apeforest
Copy link
Contributor Author

@wuxun-zhang Any issue with running the example in CPU following this document? Thanks

Copy link
Member

@yuxihu yuxihu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wuxun-zhang
Copy link
Contributor

I built MXNet from this commit by using GCC 5.3.1-6. When I built Horovod from source using pip install --no-cache-dir -v [horovod_repo_dir], there are no issues. However, when I install Horovod by pip install horovod directly, I got an error like OSError: /home/wuxunzha/anaconda3/envs/conda3_official_horovod_fp32_training/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZTINSt6thread6_StateE. I saw there are some existing issues like 656. I also checked the LD_LIBRARY_PATH, but still have the same error.

@wuxun-zhang
Copy link
Contributor

@apeforest

@yuxihu
Copy link
Member

yuxihu commented Mar 15, 2019

@wuxun-zhang when you build MXNet from source, did you enable MKLDNN? Horovod 0.16.0 release does not work with MKLDNN enabled libmxnet.so. Our fix went in after the release.

@wuxun-zhang
Copy link
Contributor

@yuxihu Thanks for reminding. I have re-installed MXNet without MKLDNN by using the command make USE_OPENCV=1 USE_MKLDNN=0 USE_BLAS=openblas -j. And using the default command pip install horovod to install horovod. When I run python ~/github/incubator-mxnet/example/distributed_training-horovod/resnet50_imagenet.py --no-cuda , there still have an error, OSError: /home/wuxunzha/anaconda3/lib/python3.6/site-packages/horovod/mxnet/mpi_lib.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZNK5mxnet7NDArray7ReshapeERKNS_6TShapeE. I don't know if it is actually related to MPI.
Note: MXNet is built based on GCC 5.3.1-6.

@apeforest
Copy link
Contributor Author

apeforest commented Mar 15, 2019

@wuxun-zhang When building from source, I think you need to run pip install --no-cache-dir . in the horovod repo. pip install horovod is installing from PyPi release.

@yuxihu
Copy link
Member

yuxihu commented Mar 15, 2019

@apeforest I think @wuxun-zhang wanted to test with the Horovod PyPi package.

@wuxun-zhang The undefined symbol is not related to MPI. Can you try with the latest MXNet? The one you were using was from January.

@wuxun-zhang
Copy link
Contributor

@apeforest There are no problem when building Horovod from source. Just want to verify if Horovod PyPi package can also work well.

@yuxihu I have tried the lastest MXNet with this commit. When I import horovod.mxnet as hvd, still got the error undefined symbol. Did you run this example successfully on CPU? If so, can you tell me what's your building command for MXNet without mkldnn? Thanks in advance.

@apeforest
Copy link
Contributor Author

apeforest commented Mar 18, 2019

@wuxun-zhang Don't have problem running on my MacBook.
Repro:

cp make/osx.mk config.mk
make -j8
pip install horovod
python example/distributed_training-horovod/resnet50_imagenet.py --no-cuda

My environment:

----------Python Info----------
Version      : 3.7.2
Compiler     : Clang 10.0.0 (clang-1000.11.45.5)
Build        : ('default', 'Feb 12 2019 08:16:38')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 19.0.3
Directory    : /Users/lnyuan/.virtualenvs/mxnet/lib/python3.7/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /Users/lnyuan/work/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-17.7.0-x86_64-i386-64bit
system       : Darwin
node         : 88e9fe759c49.ant.amazon.com
release      : 17.7.0
version      : Darwin Kernel Version 17.7.0: Thu Dec 20 21:47:19 PST 2018; root:xnu-4570.71.22~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: /~https://github.com/apache/incubator-mxnet, DNS: 0.0025 sec, LOAD: 1.1295 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0231 sec, LOAD: 0.3325 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0312 sec, LOAD: 0.4660 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0109 sec, LOAD: 0.3440 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0192 sec, LOAD: 0.8956 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0214 sec, LOAD: 0.0652 sec.

@yuxihu
Copy link
Member

yuxihu commented Mar 20, 2019

@apeforest ready to merge this one?

@apeforest
Copy link
Contributor Author

@wuxun-zhang Do you still have problem running the example on CPU following this guide?

@wuxun-zhang
Copy link
Contributor

@apeforest @yuxihu I tried the lastest MXNet repo and ran Horovod (using PyPi package) successfully on CPU. Many thanks for your help.

LGTM

@apeforest
Copy link
Contributor Author

@eric-haibin-lin Could you please help to review or merge this PR if no other concern?

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

```bash
$ pip install mxnet
```
**Note**: There is a [known issue](/~https://github.com/horovod/horovod/issues/884) when running Horovod with MXNet on a Linux system with GCC version 5.X and above. We recommend users to build MXNet from source following this [guide](https://mxnet.incubator.apache.org/install/build_from_source.html) as a workaround for now. Also mxnet-mkl package in 1.4.0 release does not support Horovod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so currently pip install doesn't work for this use case ? Is this glibc incompatibility ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It's not because of glibc incompability but due to the GCC4 and GCC5 std::function signature change. In MXNet-Horovod integration, we passed a std::function as callback from Horovod to MXNet. When Horovod and MXNet are built with different GCC versions, segmentation fault will occurr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are the pips built ? for which gcc version ? does pip have this issue currently ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MXNet pip is built with gcc4. If user builds Horovod on centos7/ubuntu14.0, there will be no issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these steps dont currently work. I would suggest changing this to easiest path currently available : 1. build mxnet with gcc 5 followed by pip install horovod OR 2. pip install mxnet followed by build horovod with gcc4 build. I feel 1 is easier for users. When we fix this bug then we can modify documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which platform are you installing?
The following steps in the README work for me on both MacOS and Amazon Linux and Centos 7 (all gcc4)

pip install mxnet
pip install horovod

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm! i think i misunderstood earlier.

@anirudh2290 anirudh2290 merged commit 056fce4 into apache:master Mar 22, 2019
hvd.init()

# Set context to current process
context = mx.cpu(hvd.local_rank()) if args.no_cuda else mx.gpu(hvd.local_rank())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think args is not defined yet. Maybe context.num_gpus()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a code skeleton to showcase the usage. The args is defined in the real example.

Copy link
Contributor

@larroy larroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the examples tested?

vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* Add examples for MXNet with Horovod

* update readme

* update examples

* update README

* update mnist_module example

* Update README

* update README

* update README

* update README
ZhennanQin pushed a commit to ZhennanQin/incubator-mxnet that referenced this pull request Apr 3, 2019
* Add examples for MXNet with Horovod

* update readme

* update examples

* update README

* update mnist_module example

* Update README

* update README

* update README

* update README
nswamy pushed a commit that referenced this pull request Apr 5, 2019
* Add examples for MXNet with Horovod

* update readme

* update examples

* update README

* update mnist_module example

* Update README

* update README

* update README

* update README
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Add examples for MXNet with Horovod

* update readme

* update examples

* update README

* update mnist_module example

* Update README

* update README

* update README

* update README
@apeforest apeforest deleted the example/horovod branch August 23, 2019 17:09
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants