-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
@mxnet-label-bot add [pr-awaiting-review] |
@ptrendx can you share a benchmark on SGD performance when |
This PR is part of upstreaming improvements to MXNet that are available in NVIDIA's NGC 18.11 MXNet container. I will use results from that container to show the impact once all the other improvements are in place. The benchmark shown is ResNet v1.5 training on single V100 32GB in DGX1-V, batch size 32.
|
Thanks for the contribution @ptrendx ! |
@@ -98,6 +99,9 @@ def dict_equ(a, b): | |||
|
|||
@with_seed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not tested with test_trainer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you asking why I am not changing the test_trainer
as well since it should fail with the MXNET_UPDATE_ON_KVSTORE=0
option set? Since you made a PR to fix that test, I did not change it. The MXNET_UPDATE_ON_KVSTORE=0
option is not set in CI (although logic for the aggregated SGD itself is tested by the SGD test).
@ptrendx Can you please rebase this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR description you said that the test_sgd
covers the new code paths. But in _update_impl
there is an if statement with aggregate
if aggregate:
...
else:
...
can you explain how the else block is covered with the current test_sgd
code.
for weight, grad in zip(weights, grads): | ||
assert(isinstance(weight, NDArray)) | ||
assert(isinstance(grad, NDArray)) | ||
aggregate = (aggregate and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anirudhacharya As you can see aggregate
is set to True
at the beginning and changes to False
when encountering non-default storage type, so testing with both dense and sparse data tests both branches of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
||
template<typename DType, typename MPDType> | ||
struct MultiSGDKernelParam { | ||
static const int N = 60; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anirudhacharya This is the reason of 60 - I pass this struct as kernel parameter, which has a limit of 4 kB.
Is there anything else needed for this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code LGTM. Some minor doc fixes are needed I think
@@ -105,6 +110,7 @@ def __init__(self, rescale_grad=1., param_idx2name=None, wd=0., | |||
self._index_update_count = {} | |||
self.clip_gradient = clip_gradient | |||
self.multi_precision = multi_precision | |||
self.aggregate_num = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add this in the parameter list in the class doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not really a parameter though - it is up to the optimizer (not the user) to override this value if they support aggregated execution.
@@ -502,6 +545,7 @@ def __init__(self, momentum=0.0, lazy_update=True, **kwargs): | |||
super(SGD, self).__init__(**kwargs) | |||
self.momentum = momentum | |||
self.lazy_update = lazy_update | |||
self.aggregate_num = int(os.getenv('MXNET_OPTIMIZER_AGGREGATION_SIZE', "4")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in line 510 can you add a section on aggregate updates and in line 524 can also point to these two methods - multi_sgd_mom_update
and multi_mp_sgd_update
as optimizer update rules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the section on aggregate updates, but I'm not sure about pointing to the new methods in line 524 - they use the same algorithm as the sgd_update
and sgd_mom_update
functions so pointing to those functions for details of the algorithm seems sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the 'multi' update methods should show up in the SGD doc description. But I am okay with the code owner/merger making a call on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary to point to these two methods since the algorithm is the same one
LGTM @eric-haibin-lin Open question brought by @anirudh2290. In your opinion should 'multi' update methods should show up in the SGD doc description? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good pending some suggestions for documentation. Awesome work
* Aggregate SGD * Make OpWrapperGenerator understand Tuple<float> * Trigger * Add NNVM Tuple to cpp-package op.h * Trigger * Fix pylint aggregate SGD * Update info about new ENV vars and modifying 2 tests that require update_on_kvstore to be true * Fix * Aggregate SGD support for Gluon trainer * Added text to doc about aggregate update in SGD optimizer * Docs changes from review
This reverts commit 0a45e1a.
This reverts commit 0a45e1a.
This reverts commit 0a45e1a.
This reverts commit 0a45e1a.
This reverts commit fabc318.
* Aggregate SGD * Make OpWrapperGenerator understand Tuple<float> * Trigger * Add NNVM Tuple to cpp-package op.h * Trigger * Fix pylint aggregate SGD * Update info about new ENV vars and modifying 2 tests that require update_on_kvstore to be true * Fix * Aggregate SGD support for Gluon trainer * Added text to doc about aggregate update in SGD optimizer * Docs changes from review
Missing type information for some parameteres E.g. From here https://mxnet.incubator.apache.org/api/python/symbol/symbol.html#mxnet.symbol.multi_mp_sgd_mom_update
This should be And
|
* Aggregate SGD * Make OpWrapperGenerator understand Tuple<float> * Trigger * Add NNVM Tuple to cpp-package op.h * Trigger * Fix pylint aggregate SGD * Update info about new ENV vars and modifying 2 tests that require update_on_kvstore to be true * Fix * Aggregate SGD support for Gluon trainer * Added text to doc about aggregate update in SGD optimizer * Docs changes from review
Description
Currently MXNet optimizers are invoked 1 weight at a time. This leads to a lot of synchronization overhead, as updates (especially for convolutions and batchnorm) tend to be small, but each one needs to by synchronized upon.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
update_on_kvstore
value via environment variableMXNET_UPDATE_ON_KVSTORE
(default is 1, which is consistent with the current behavior)update_on_kvstore
is False, in the case of SGD optimizer it attempts to bundle updates of multiple weights together and launches a single kernel to perform them all, reducing the number of kernel calls and synchronizations.Comments