[TUTORIAL] Add multiple GPUs training tutorial #15158

Ishitori · 2019-06-05T22:04:27Z

Description

Add new tutorial about multigpu training using Gluon API.

roywei · 2019-06-06T07:54:28Z

docs/tutorials/gluon/multi_gpu.md

+import mxnet as mx
+
+a = mx.nd.array([1, 2, 3], ctx=mx.gpu(0))
+b = mx.nd.array([5, 6, 7], ctx=mx.gpu(1))


the tutorial nightly test has changed to use P3.2xlarge with 1GPU, so this may fail

Hm, whole this tutorial is about how to do multigpu training. I guess if this is the case, I will have to remove it from nightly tests.

yes, maybe comment and add it to whitelist: /~https://github.com/apache/incubator-mxnet/blob/master/tests/tutorials/test_sanity_tutorials.py#L27

Good idea, added to the whitelist

I don't think it's a good idea to not test it. I'll suggest changes to make it testable and carry the same information

piyushghai · 2019-06-07T05:58:04Z

Thanks for your contributions @Ishitori .
@mxnet-label-bot Add [pr-awaiting-review, Doc]

thomelane

thanks @Ishitori. some rewording is required in few places.

thomelane · 2019-06-10T22:28:07Z

docs/tutorials/gluon/multi_gpu.md

+## Prerequisites
+
+- Two or more GPUs 
+- Cuda 9 or higher


Same comments as @vishaalkapoor in Float16 tutorial

CUDA and CuDNN

thomelane · 2019-06-10T22:28:48Z

docs/tutorials/gluon/multi_gpu.md

+c = a + b.as_in_context(a.context)
+```
+
+Using this example we have learnt that we can perform operations with NDArrays only if they are stored on the same GPU. So, how can we split the data between GPUs, but use the same model for training? We will answer this question in the next session.


Using this example -> Using this example,

session -> section

thomelane · 2019-06-10T22:29:04Z

docs/tutorials/gluon/multi_gpu.md

+
+## Storing the network on multiple GPUs
+
+When you create a network using [Blocks](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Block) the parameters of blocks are also stored in a form of NDArray. When you initialize your network, you have to specify which context you are going to use for the underlying NDArrays. The feature of the [initialize method](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Block.initialize) is that it can accept the list of contexts, meaning that you can provide more than one context to store underlying parameters. In the example below we create the LeNet network and initialize it to be stored on GPU(0) and GPU(1) simultaneously. Each GPU will receive its own copy of the parameters:


In the example below -> In the example below,

stored in a form of NDArray -> stored in NDArrays

thomelane · 2019-06-10T22:36:49Z

docs/tutorials/gluon/multi_gpu.md

+
+To do multiple GPU training with a given batch of the data, we divide the examples in the batch into number of portions equal to the number of GPUs we use and distribute one to each GPU. Then, each GPU will individually calculate the local gradient of the model parameters based on the batch subset it was assigned and the model parameters it maintains. Next, we sum together the local gradients on the GPUs to get the current batch stochastic gradient. After that, each GPU uses this batch stochastic gradient to update the complete set of model parameters that it maintains. Figure below depicts the batch stochastic gradient calculation using data parallelism and two GPUs.
+
+![data-parallel](https://www.d2l.ai/_images/data-parallel.svg)


Can you move this and other image dependencies to web-data repo.

thomelane · 2019-06-10T22:38:02Z

docs/tutorials/gluon/multi_gpu.md

+
+# Multiple GPUs training with Gluon API
+
+In this tutorial we will walk through how one can train deep learning neural networks on multiple GPUs within a single machine. This tutorial focuses on data parallelism oppose to model parallelism. The latter is not supported by Apache MXNet out of the box, and one have to manually route the data among different devices to achieve model parallelism. Check out [model parallelism tutorial](https://mxnet.incubator.apache.org/versions/master/faq/model_parallel_lstm.html) to learn more about it.


oppose -> as opposed

can you give a quick explaination of 'data parallelism' or link to good explaination.

one have to -> one has to

thomelane · 2019-06-10T22:50:10Z

docs/tutorials/gluon/multi_gpu.md

+
+As we mentioned above, the gradients for each data split are calculated independently and then later summed together. We haven't mentioned yet where exactly this aggregation happens.
+
+Apache MXNet uses [KVStore](https://mxnet.incubator.apache.org/versions/master/api/scala/kvstore.html) - a virtual place for data sharing between different devices, including machines and GPUs. The KVStore is responsible for storing and, by default, aggregating the gradients of the model. The physical location of the KVStore is defined when we create a [trainer](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.Trainer) and by default is set to `device`, which mean it will aggregate gradients and update weights on GPUs. The actual data is distributed in round-robin fashion among available GPUs per block. This statement means two things, which are important to know from practical perspective.


trainer -> Trainer

thomelane · 2019-06-10T22:50:47Z

docs/tutorials/gluon/multi_gpu.md

+
+Apache MXNet uses [KVStore](https://mxnet.incubator.apache.org/versions/master/api/scala/kvstore.html) - a virtual place for data sharing between different devices, including machines and GPUs. The KVStore is responsible for storing and, by default, aggregating the gradients of the model. The physical location of the KVStore is defined when we create a [trainer](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.Trainer) and by default is set to `device`, which mean it will aggregate gradients and update weights on GPUs. The actual data is distributed in round-robin fashion among available GPUs per block. This statement means two things, which are important to know from practical perspective.
+
+The first thing is there is an additional memory allocation happens on GPUs that is not directly related to your data and your model to store auxiliary information for GPUs sync-up. Depending on the complexity of your model, the amount of required memory can be significant, and you may even experience CUDA out of memory exceptions. If that is the case, and you cannot decrease batch size anymore, you may want to consider switching `KVStore` storage to RAM by setting `kvstore` argument to `local` during instantiation of the `Trainer`. That most probably will decrease the wall-clock performance time of your model, because the gradients and parameters would need to be copied to RAM and back.


happens on GPUs -> that happens on GPUs

That most probably will-> Often this decreases

thomelane · 2019-06-10T22:52:05Z

docs/tutorials/gluon/multi_gpu.md

+
+The first thing is there is an additional memory allocation happens on GPUs that is not directly related to your data and your model to store auxiliary information for GPUs sync-up. Depending on the complexity of your model, the amount of required memory can be significant, and you may even experience CUDA out of memory exceptions. If that is the case, and you cannot decrease batch size anymore, you may want to consider switching `KVStore` storage to RAM by setting `kvstore` argument to `local` during instantiation of the `Trainer`. That most probably will decrease the wall-clock performance time of your model, because the gradients and parameters would need to be copied to RAM and back.
+
+The second thing is that since that auxiliary information distributed among GPUs in round-robin fashion on per block level, `KVStore` may use more memory on some GPUs and less on others. For example, if your model has a very big embedding layer, you may see that your first GPU uses 90% of your memory while others use only 50%. That affects how much data you actually can load in a single batch, because the data between devices is split evenly. If that is the case, again, and you have to keep or increase your batch size, you, again, may want to switch to the `local` mode.


that auxiliary information distributed among GPUs -> the auxiliary information is distributed among GPUs

remove ', again,' both times

thomelane · 2019-06-10T22:53:24Z

docs/tutorials/gluon/multi_gpu.md

+
+## Conclusion
+
+With Apache MXNet training using multiple GPUs doesn't need a lot of extra code. To do the multiple GPUs training one needs to initialize a model on all GPUs, split the batches of data into separate splits where each is stored on a different GPU and run the model separately on every split. The synchronization of gradients and parameters between GPUs is done automatically by Apache MXNet.


thomelane · 2019-06-10T22:53:49Z

docs/tutorials/gluon/multi_gpu.md

+
+## Recommended Next Steps
+
+* Check out our two video tutorial on improving your code performance. In the [first video](https://www.youtube.com/watch?v=n8tN6pRZBdE) we explain how to visualize the performance, and in the [second video](https://www.youtube.com/watch?v=Cqo7FPftNyo) we show how to optimize it


optimize it -> optimize it.

Ishitori · 2019-06-12T19:43:10Z

Fixed everything mentioned above.

thomelane · 2019-06-12T23:29:05Z

docs/tutorials/gluon/multi_gpu.md


 ## Multiple GPUs classification of MNIST images

-In the first step, we are going to load the MNIST images, switch the format of data from `height x width x channel` to `channel x height x width` and normalize the data
+In the first step, we are going to load the MNIST imagesa and use [ToTensor](https://mxnet.apache.org/api/python/gluon/data.html#mxnet.gluon.data.vision.transforms.ToTensor) to convert the format of the data from `height x width x channel` to `channel x height x width` and divide it by 255.


imagesa -> images

thomelane · 2019-06-12T23:29:59Z

LGTM

ThomasDelteil · 2019-06-13T21:13:19Z

docs/tutorials/gluon/multi_gpu.md

+import mxnet as mx
+
+a = mx.nd.array([1, 2, 3], ctx=mx.gpu(0))
+b = mx.nd.array([5, 6, 7], ctx=mx.gpu(1))


I don't think it's a good idea to not test it. I'll suggest changes to make it testable and carry the same information

ThomasDelteil · 2019-06-13T21:13:36Z

docs/tutorials/gluon/multi_gpu.md

+```python
+import mxnet as mx
+
+a = mx.nd.array([1, 2, 3], ctx=mx.gpu(0))


use context[0] and context[1]

ThomasDelteil · 2019-06-13T21:16:06Z

docs/tutorials/gluon/multi_gpu.md

+from mxnet import init
+from mxnet.gluon import nn
+
+context = [mx.gpu(0), mx.gpu(1)]


can you make this a

n_gpu = mx.context.num_gpus() context = [mx.gpu(0), mx.gpu(1)] if n_gpu >= 2 else [mx.gpu(), mx.gpu()] if n_gpu == 1 else [mx.cpu(), mx.cpu()]

Ishitori · 2019-06-13T21:50:03Z

Added the tutorial back to test by applying @ThomasDelteil trick.

anirudhacharya · 2019-06-13T22:33:24Z

docs/tutorials/gluon/multi_gpu.md

+
+The first thing is there is an additional memory allocation that happens on GPUs that is not directly related to your data and your model to store auxiliary information for GPUs sync-up. Depending on the complexity of your model, the amount of required memory can be significant, and you may even experience CUDA out of memory exceptions. If that is the case, and you cannot decrease batch size anymore, you may want to consider switching `KVStore` storage to RAM by setting `kvstore` argument to `local` during instantiation of the `Trainer`. Often this decreases the wall-clock performance time of your model, because the gradients and parameters would need to be copied to RAM and back.
+
+The second thing is that since  the auxiliary information is distributed among GPUs in round-robin fashion on per block level, `KVStore` may use more memory on some GPUs and less on others. For example, if your model has a very big embedding layer, you may see that your first GPU uses 90% of your memory while others use only 50%. That affects how much data you actually can load in a single batch, because the data between devices is split evenly. If that is the case and you have to keep or increase your batch size, you may want to switch to the `local` mode.


just a question, here should we also mention about dist_device_sync mode of kvstore used for distributed training with updates on GPUs?

According to the docs dist_device_sync make sense only for distributed training, when there are more than 1 host. With mutligpu training on a single host, which is covered in this tutorial, only local and device modes makes sense.

roywei · 2019-06-14T21:38:22Z

@Ishitori there seems a waring and caused nightly to fail.
Please fix it.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/NightlyTestsForBinaries/detail/master/347/pipeline

Ishitori · 2019-06-14T22:16:39Z

Fixed here #15248

* Add multiple GPUs training tutorial * Add download source button * Add tutorial to the test suite * Remove from nightly build (no CI multigpu machines) * Add extension to whitelisted multigpu tutorial * Force build * Force update * Code review fixes * Force build * Typo fix and force build * Add tutorial back to tests * Add tutorial to the index * Force build

Add multiple GPUs training tutorial

4b3c70e

Ishitori requested a review from szha as a code owner June 5, 2019 22:04

Sergey Sokolov added 2 commits June 5, 2019 15:23

Add download source button

7df5eb0

Add tutorial to the test suite

2c6ac29

roywei reviewed Jun 6, 2019

View reviewed changes

marcoabreu added Doc pr-awaiting-review PR is waiting for code review labels Jun 7, 2019

Sergey Sokolov added 4 commits June 7, 2019 11:57

Remove from nightly build (no CI multigpu machines)

6aaf1dc

Add extension to whitelisted multigpu tutorial

899b6ea

Force build

3796767

Force update

4bdd047

thomelane suggested changes Jun 10, 2019

View reviewed changes

Code review fixes

f023f36

Force build

d10a6ab

thomelane reviewed Jun 12, 2019

View reviewed changes

Typo fix and force build

aa4eab5

ThomasDelteil reviewed Jun 13, 2019

View reviewed changes

Add tutorial back to tests

ed9519e

Sergey Sokolov added 2 commits June 13, 2019 15:16

Add tutorial to the index

2d9c170

Force build

e4b9303

anirudhacharya reviewed Jun 13, 2019

View reviewed changes

ThomasDelteil merged commit 41d35c4 into apache:master Jun 14, 2019

This was referenced Jun 14, 2019

[Backport][v1.5.x] Improve static cached_op optimization (#15187) #15236

Closed

[Backport][v1.5.x]Add missing file to Maven clean (#15216) #15235

Closed

Ishitori mentioned this pull request Jun 14, 2019

Fix nightly build warning #15247

Closed

Ishitori mentioned this pull request Jun 14, 2019

Fix nightly build warning #15248

Merged


		## Storing the network on multiple GPUs

		When you create a network using [Blocks](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Block) the parameters of blocks are also stored in a form of NDArray. When you initialize your network, you have to specify which context you are going to use for the underlying NDArrays. The feature of the [initialize method](https://mxnet.incubator.apache.org/api/python/gluon/gluon.html#mxnet.gluon.Block.initialize) is that it can accept the list of contexts, meaning that you can provide more than one context to store underlying parameters. In the example below we create the LeNet network and initialize it to be stored on GPU(0) and GPU(1) simultaneously. Each GPU will receive its own copy of the parameters:


		To do multiple GPU training with a given batch of the data, we divide the examples in the batch into number of portions equal to the number of GPUs we use and distribute one to each GPU. Then, each GPU will individually calculate the local gradient of the model parameters based on the batch subset it was assigned and the model parameters it maintains. Next, we sum together the local gradients on the GPUs to get the current batch stochastic gradient. After that, each GPU uses this batch stochastic gradient to update the complete set of model parameters that it maintains. Figure below depicts the batch stochastic gradient calculation using data parallelism and two GPUs.

		![data-parallel](https://www.d2l.ai/_images/data-parallel.svg)


		# Multiple GPUs training with Gluon API

		In this tutorial we will walk through how one can train deep learning neural networks on multiple GPUs within a single machine. This tutorial focuses on data parallelism oppose to model parallelism. The latter is not supported by Apache MXNet out of the box, and one have to manually route the data among different devices to achieve model parallelism. Check out [model parallelism tutorial](https://mxnet.incubator.apache.org/versions/master/faq/model_parallel_lstm.html) to learn more about it.


		As we mentioned above, the gradients for each data split are calculated independently and then later summed together. We haven't mentioned yet where exactly this aggregation happens.

		Apache MXNet uses [KVStore](https://mxnet.incubator.apache.org/versions/master/api/scala/kvstore.html) - a virtual place for data sharing between different devices, including machines and GPUs. The KVStore is responsible for storing and, by default, aggregating the gradients of the model. The physical location of the KVStore is defined when we create a [trainer](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.Trainer) and by default is set to `device`, which mean it will aggregate gradients and update weights on GPUs. The actual data is distributed in round-robin fashion among available GPUs per block. This statement means two things, which are important to know from practical perspective.


		Apache MXNet uses [KVStore](https://mxnet.incubator.apache.org/versions/master/api/scala/kvstore.html) - a virtual place for data sharing between different devices, including machines and GPUs. The KVStore is responsible for storing and, by default, aggregating the gradients of the model. The physical location of the KVStore is defined when we create a [trainer](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.Trainer) and by default is set to `device`, which mean it will aggregate gradients and update weights on GPUs. The actual data is distributed in round-robin fashion among available GPUs per block. This statement means two things, which are important to know from practical perspective.

		The first thing is there is an additional memory allocation happens on GPUs that is not directly related to your data and your model to store auxiliary information for GPUs sync-up. Depending on the complexity of your model, the amount of required memory can be significant, and you may even experience CUDA out of memory exceptions. If that is the case, and you cannot decrease batch size anymore, you may want to consider switching `KVStore` storage to RAM by setting `kvstore` argument to `local` during instantiation of the `Trainer`. That most probably will decrease the wall-clock performance time of your model, because the gradients and parameters would need to be copied to RAM and back.


		The first thing is there is an additional memory allocation happens on GPUs that is not directly related to your data and your model to store auxiliary information for GPUs sync-up. Depending on the complexity of your model, the amount of required memory can be significant, and you may even experience CUDA out of memory exceptions. If that is the case, and you cannot decrease batch size anymore, you may want to consider switching `KVStore` storage to RAM by setting `kvstore` argument to `local` during instantiation of the `Trainer`. That most probably will decrease the wall-clock performance time of your model, because the gradients and parameters would need to be copied to RAM and back.

		The second thing is that since that auxiliary information distributed among GPUs in round-robin fashion on per block level, `KVStore` may use more memory on some GPUs and less on others. For example, if your model has a very big embedding layer, you may see that your first GPU uses 90% of your memory while others use only 50%. That affects how much data you actually can load in a single batch, because the data between devices is split evenly. If that is the case, again, and you have to keep or increase your batch size, you, again, may want to switch to the `local` mode.


		## Conclusion

		With Apache MXNet training using multiple GPUs doesn't need a lot of extra code. To do the multiple GPUs training one needs to initialize a model on all GPUs, split the batches of data into separate splits where each is stored on a different GPU and run the model separately on every split. The synchronization of gradients and parameters between GPUs is done automatically by Apache MXNet.


		## Recommended Next Steps

		* Check out our two video tutorial on improving your code performance. In the [first video](https://www.youtube.com/watch?v=n8tN6pRZBdE) we explain how to visualize the performance, and in the [second video](https://www.youtube.com/watch?v=Cqo7FPftNyo) we show how to optimize it


		The first thing is there is an additional memory allocation that happens on GPUs that is not directly related to your data and your model to store auxiliary information for GPUs sync-up. Depending on the complexity of your model, the amount of required memory can be significant, and you may even experience CUDA out of memory exceptions. If that is the case, and you cannot decrease batch size anymore, you may want to consider switching `KVStore` storage to RAM by setting `kvstore` argument to `local` during instantiation of the `Trainer`. Often this decreases the wall-clock performance time of your model, because the gradients and parameters would need to be copied to RAM and back.

		The second thing is that since the auxiliary information is distributed among GPUs in round-robin fashion on per block level, `KVStore` may use more memory on some GPUs and less on others. For example, if your model has a very big embedding layer, you may see that your first GPU uses 90% of your memory while others use only 50%. That affects how much data you actually can load in a single batch, because the data between devices is split evenly. If that is the case and you have to keep or increase your batch size, you may want to switch to the `local` mode.

[TUTORIAL] Add multiple GPUs training tutorial #15158

[TUTORIAL] Add multiple GPUs training tutorial #15158

Conversation

Ishitori commented Jun 5, 2019

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piyushghai commented Jun 7, 2019

thomelane left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ishitori commented Jun 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomelane commented Jun 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ishitori commented Jun 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roywei commented Jun 14, 2019

Ishitori commented Jun 14, 2019