Signum optimizer #9220

yuxiangw · 2017-12-28T21:52:40Z

Description

Added the C++ implementation of the Signum optimizer.

Bernstein, Wang, Azizzadenesheli and Anandkumar (2017) "The Signum optimiser: a theory of momentum in quantised stochastic optimisation"
Link to pdf

What's also included is the implementation of an option to do the alternative version of weight decay regularization due to Loshchilov and Hutter via option 'wd_lh'.
"Fixing Weight Decay Regularization in Adam"
Link to arxiv

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Added the Signum optimizer to mxnet's list of optimizers
A special case is SignSGD optimizer, a stand-alone implementation whenever "momentum" is set to 0.

Comments

TODO1: add sparse matrix support for this optimizer
TODO2: Take advantage of the 1-bit gradient compression interpretation of SignSGD and Signum.
TODO3: Adding 'wd_lh' support for Adam and other adaptive gradient optimizers.

eric-haibin-lin

Thanks for the contribution!! Please see detailed comments in code.

eric-haibin-lin · 2018-01-02T22:42:11Z

src/operator/optimizer_op.cu

@@ -28,6 +28,14 @@
 namespace mxnet {
 namespace op {

+NNVM_REGISTER_OP(signsgd_update)
+.set_attr<FCompute>("FCompute<gpu>", SignSGDUpdate<gpu>);
+// .set_attr<FComputeEx>("FComputeEx<gpu>", SignSGDUpdateEx<gpu>);


Could you remove unused lines?

eric-haibin-lin · 2018-01-02T22:42:29Z

src/operator/optimizer_op.cc

+    return std::vector<uint32_t>{2};
+  })
+.set_attr<FCompute>("FCompute<cpu>", SignumUpdate<cpu>)
+// .set_attr<FComputeEx>("FComputeEx<cpu>", SGDMomUpdateEx<cpu>)


Please removed unused lines (also the ones in line 42, 65)

eric-haibin-lin · 2018-01-02T22:44:39Z

python/mxnet/optimizer.py

@@ -57,6 +58,10 @@ class Optimizer(object):
        The weight decay (or L2 regularization) coefficient. Modifies objective
        by adding a penalty for having large weights.

+    wd_lh: float, optional


I don't see a change in the Optimizer class constructor. Why is this changed?

I added that to the constructor at some point, cuz wd_lh is something more generally applicable to other algorithms too (in particular, Adam).

removed that line.

eric-haibin-lin · 2018-01-02T22:48:14Z

src/operator/optimizer_op.cc

+
+** Sparse matrix not supported for this optimizer yet.
+
+If weight and momentum are both of ``row_sparse`` storage type,


I'd rather remove the line 81-87 since sparse update is not supported anyway.

eric-haibin-lin · 2018-01-02T22:49:52Z

src/operator/optimizer_op.cc

+
+Where the parameter ``momentum`` is the decay rate of momentum estimates at each epoch.
+
+** Sparse matrix not supported for this optimizer yet.


Not sure if sentence starting with ** renders well in API doc. What about adding a "note" section like rint?
/~https://github.com/apache/incubator-mxnet/blob/ae70769c8e35cc178bf7dd9dba35386c13394394/src/operator/tensor/elemwise_unary_op_basic.cc#L432-L434
Also, term "sparse ndarray" instead of "sparse matrix" is preferred :)

eric-haibin-lin · 2018-01-02T22:52:06Z

src/operator/optimizer_op.cc

+
+NNVM_REGISTER_OP(signsgd_update)
+// MXNET_ADD_SPARSE_OP_ALIAS(signsgd_update)
+.describe(R"code(Update function for SignSGDoptimizer.


nit: SignSGD optimizer

done. and added the math description block similar to other optimizers.

eric-haibin-lin · 2018-01-02T22:52:49Z

src/operator/optimizer_op.cc

+ weight = weight - learning_rate * sign(gradient)
+
+
+** Sparse matrix not supported for this optimizer yet.


Same comment for documentation rendering and FInferStorageType in signum_update

eric-haibin-lin · 2018-01-02T23:01:44Z

python/mxnet/optimizer.py

+
+@register
+class Signum(Optimizer):
+    """The SGD optimizer with momentum and weight decay.


The one line summary should also mention it only takes the sign. Otherwise the readers don't know it until they see line 547

added details to the doc accordingly.

eric-haibin-lin · 2018-01-02T23:04:23Z

src/operator/optimizer_op-inl.h

+  float lr;
+  float wd;
+  float rescale_grad;
+  float clip_gradient;


If the clip_gradient param has no effect on both SignSGD and Signum, can we just remove this param from signsgd_update and signum_update? That would also simply the c++ kernels

It has an effect on Signum. Because it will lead to different result whether we use gradient or clipped gradient for calculating momentum.

Ah, I see. Thanks for the explanation!

eric-haibin-lin · 2018-01-02T23:08:06Z

python/mxnet/optimizer.py

+    momentum : float, optional
+       The momentum value.
+    wd_lh : float, optitional
+       The amount of decoupled weight decay regularization.


Let's also add a reference/link to the original paper

added the temp link to pdf hosted on jeremy's site. will update to arxiv or a published version when they are ready.

eric-haibin-lin · 2018-01-06T05:37:00Z

There are new conflicts now. Do you mind resolving them again?
BTW - the files under cpp-package are only needed if you use cpp as front end to train networks. Do you actually need it?

yuxiangw · 2018-01-08T19:08:20Z

Done fixing the conflicts.

eric-haibin-lin · 2018-01-08T19:23:51Z

@lx75249 could you help review the code for cpp-package?

conopt · 2018-01-09T08:17:46Z

@eric-haibin-lin LGTM

piiswrong · 2018-01-09T19:19:04Z

python/mxnet/optimizer.py

+            signum_update(weight, grad, state, out=weight,
+                          lr=lr, wd=wd, **kwargs)
+        else:
+            signsgd_update(weight, grad, out=weight,


what's this?

well, signsgd takes the sign of stochastic gradient.

piiswrong · 2018-01-09T19:40:46Z

python/mxnet/optimizer.py

+
+        rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
+        state = momentum * state + (1-momentum)*rescaled_grad
+        weight = (1 - lr * wd_lh) * weight - lr * sign(state)


what's wd_lh? Is it from the original paper?

It is an alternative weight decay. See the descriptions.

Since wd_lh is new, I suggest put a reference link to the original paper by Loshchilov and Frank Hutter in the documentation

piiswrong · 2018-01-09T19:42:07Z

python/mxnet/optimizer.py

+            kwargs['wd_lh'] = self.wd_lh
+
+        if state is not None:
+            signum_update(weight, grad, state, out=weight,


call these signum_momentum_update and signum_update to be consistent with others

RE: naming.

signum means SIGN momentUM. So the semantics of the momentum is already in there. -

SignSGD is the special case of Signum that goes without momentum. And it has been used before.

Unless we change the names in our paper, let's keep them the way they are.

eric-haibin-lin

One final comment. Otherwise LGTM. Thanks for the contribution!

eric-haibin-lin · 2018-01-11T01:36:43Z

python/mxnet/optimizer.py

+
+        rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight
+        state = momentum * state + (1-momentum)*rescaled_grad
+        weight = (1 - lr * wd_lh) * weight - lr * sign(state)


Since wd_lh is new, I suggest put a reference link to the original paper by Loshchilov and Frank Hutter in the documentation

yuxiangw · 2018-01-11T08:34:04Z

Added the reference the documentation as suggested. Thanks guys for reviewing the PR!

piiswrong · 2018-01-12T19:20:10Z

Thanks

* the c++ version of signum and signsgd optimizer * optimizer signum, tested working with mac on cpuusing mnist * unit test for signum * fix lint and incorporate haibin's code review * rerun jenkins * adding link to the Loshachilov and Hutter to the documentation

eric-haibin-lin reviewed Jan 2, 2018

View reviewed changes

yuxiangw force-pushed the signum_optimizer branch from 30d980c to 955c7f0 Compare January 4, 2018 22:30

yuxiangw added 4 commits January 8, 2018 11:01

the c++ version of signum and signsgd optimizer

60f00a7

optimizer signum, tested working with mac on cpuusing mnist

9341fdd

unit test for signum

657fd1f

fix lint and incorporate haibin's code review

dc6fb2d

yuxiangw force-pushed the signum_optimizer branch from 955c7f0 to dc6fb2d Compare January 8, 2018 19:07

piiswrong reviewed Jan 9, 2018

View reviewed changes

rerun jenkins

46e45ff

eric-haibin-lin reviewed Jan 11, 2018

View reviewed changes

adding link to the Loshachilov and Hutter to the documentation

6d7525a

piiswrong merged commit 5251b86 into apache:master Jan 12, 2018

yuxiangw mentioned this pull request Jan 25, 2018

Signum with grad compression #9558

Closed

7 tasks

chsin mentioned this pull request Feb 21, 2018

Why no Codeowner for /cpp-package/? #9855

Closed


		** Sparse matrix not supported for this optimizer yet.

		If weight and momentum are both of ``row_sparse`` storage type,


		Where the parameter ``momentum`` is the decay rate of momentum estimates at each epoch.

		** Sparse matrix not supported for this optimizer yet.

		weight = weight - learning_rate * sign(gradient)


		** Sparse matrix not supported for this optimizer yet.

Signum optimizer #9220

Signum optimizer #9220

Conversation

yuxiangw commented Dec 28, 2017 • edited Loading

Description

Checklist

Essentials

Changes

Comments

eric-haibin-lin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin commented Jan 6, 2018

yuxiangw commented Jan 8, 2018

eric-haibin-lin commented Jan 8, 2018

conopt commented Jan 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxiangw commented Jan 11, 2018

piiswrong commented Jan 12, 2018

yuxiangw commented Dec 28, 2017 •

edited

Loading

eric-haibin-lin left a comment •

edited

Loading