[MXNet-1340][Fit API]Update train stats #14494

roywei · 2019-03-21T18:26:53Z

Description

In the previous Fit-API design doc , training statistics was stored in a dictionary, some values are stored as a list such learning rates and training accuraccy over epochs. users has to understand the underlying data structure inorder to access the statistics.
This PR improves how event handlers access train stats, and reduced empty method call in event handlers to improve efficiency.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with MXNET-1340 created
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

python/mxnet/gluon/estimator/estimator.py

piyushghai · 2019-03-21T20:01:18Z

python/mxnet/gluon/trainer.py

+        if isinstance(self._optimizer, opt.Optimizer):
+            return self._optimizer
+        else:
+            raise UserWarning("Optimizer has not been initialized yet")


What if the user sets a custom optimizer here?

a custom optimizer should still inherit the base Optimizer class. Gluon trainer does the check: /~https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/trainer.py#L123

piyushghai · 2019-03-21T20:05:05Z

python/mxnet/gluon/estimator/estimator.py

+            self.trainer = gluon.Trainer(self.net.collect_params(),
+                                           'sgd', {'learning_rate': 0.001})
+        elif not isinstance(trainer, gluon.Trainer):
+            raise ValueError("Trainer must be a Gluon Trainer instance, refer to gluon.trainer")


Since you say refer to gluon.trainer, you could probably also add a url to gluon.trainer docs : http://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#trainer

piyushghai · 2019-03-21T20:12:00Z

python/mxnet/gluon/estimator/estimator.py

-                    self.train_stats['step'] = i
+                self.train_history.batch_idx = i
+                # record trained samples v.s. total samples if using Gluon DataLoader
+                if isinstance(train_data, gluon.data.DataLoader):


You might need to rebase this line with the fit-api branch again.
/~https://github.com/apache/incubator-mxnet/blob/fit-api/python/mxnet/gluon/estimator/estimator.py#L245

pinaraws · 2019-03-25T01:19:59Z

@mxnet-label-bot add[pr-awaiting-review, Gluon]

roywei · 2019-03-27T22:50:52Z

python/mxnet/gluon/estimator/estimator.py

        for handler in event_handlers:
+            handler.estimator = self


@nswamy This will avoid to ask user passing estimator during event handler construction, reference: #14462 (comment)

I am wondering how the user of handler will know that an estimator will be initialized here? Also can you have a setter and getter for the estimator in Handler and not call handler.setEstimator(e) if handler.getEstimator() is not None.

@nswamy when user call est.fit(xxx, event_handlers=XXX), this will already associate the event handlers with an estimator instance. I m just helping the user to pass this estimator so they don't need to do so during event handler construction.
The getter and setter are already implemented through the property interface. handler.estimator=self is actually the setter method of property estimator.

nswamy · 2019-04-01T23:51:34Z

python/mxnet/gluon/estimator/estimator.py

            handler.train_end()
+
+    def categorize_handlers(self, event_handlers):


nit-> categorize_handlers-> _ categorize_handlers. Don't think we need this exposed to users.

nswamy · 2019-04-01T23:56:31Z

python/mxnet/gluon/estimator/estimator.py

+        batch_end = []
+        epoch_end = []
+        train_end = []
+        base_handler = EventHandler()


if you are checking class methods, do you need to create an instance here?

nswamy · 2019-04-02T00:01:53Z

python/mxnet/gluon/estimator/estimator.py

                 metrics=None,
                 initializer=None,
-                 trainers=None,
+                 trainer=None,
                 context=None):

        self.net = net


do you want to set self._estimator = None?

This is the estimator class, only event handlers should have self._estimator?

nswamy · 2019-04-02T00:03:23Z

python/mxnet/gluon/estimator/estimator.py

        for handler in event_handlers:
+            handler.estimator = self


I am wondering how the user of handler will know that an estimator will be initialized here? Also can you have a setter and getter for the estimator in Handler and not call handler.setEstimator(e) if handler.getEstimator() is not None.

roywei · 2019-04-02T16:11:08Z

@nswamy I have addressed the comments, could you take another look? thanks!

piyushghai · 2019-04-02T21:22:33Z

@roywei Can you look at the CI failures ?

roywei · 2019-04-02T21:32:11Z

@piyushghai it's all due to R package failure, i think it's ok. We will rebase before merging to master, and hopefully the R test will pass.

* add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers

* [MXNet-1334][Fit API]base class for estimator and eventhandler (#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (#14633) * move to gluon contrib (#14635) * [Fit API] improve event handlers (#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

* add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers

* [MXNet-1334][Fit API]base class for estimator and eventhandler (apache#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (apache#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (apache#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (apache#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (apache#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (apache#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (apache#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (apache#14633) * move to gluon contrib (apache#14635) * [Fit API] improve event handlers (apache#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (apache#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (apache#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (apache#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

* add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers

add train history

d71eba9

roywei requested review from eric-haibin-lin and szha as code owners March 21, 2019 18:26

roywei changed the base branch from master to fit-api March 21, 2019 18:27

nswamy reviewed Mar 21, 2019

View reviewed changes

python/mxnet/gluon/estimator/estimator.py Outdated Show resolved Hide resolved

piyushghai reviewed Mar 21, 2019

View reviewed changes

update history

a8c2c7f

marcoabreu added Gluon pr-awaiting-review PR is waiting for code review labels Mar 25, 2019

roywei added 5 commits March 25, 2019 14:29

update test

29c68b4

merge with fit-api

ad4041b

avoid calling empty methods

b8ec43c

remove train history object

904cdc7

fix pylint

c6fe873

roywei commented Mar 27, 2019

View reviewed changes

roywei added 2 commits March 27, 2019 21:00

add unit test

6e690ca

fix test

55b102e

roywei changed the title ~~[MXNet-1340][Fit API]Adding train history class~~ [MXNet-1340][Fit API]Update train stats Apr 1, 2019

nswamy reviewed Apr 2, 2019

View reviewed changes

update categorize handlers

feac6e3

nswamy merged commit ed7f6e5 into apache:fit-api Apr 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNet-1340][Fit API]Update train stats #14494

[MXNet-1340][Fit API]Update train stats #14494

roywei commented Mar 21, 2019 •

edited

Loading

piyushghai Mar 21, 2019

roywei Mar 27, 2019

piyushghai Mar 21, 2019

piyushghai Mar 21, 2019

pinaraws commented Mar 25, 2019

roywei Mar 27, 2019

nswamy Apr 2, 2019

roywei Apr 2, 2019 •

edited

Loading

nswamy Apr 1, 2019

roywei Apr 2, 2019

nswamy Apr 1, 2019

roywei Apr 2, 2019

nswamy Apr 2, 2019

roywei Apr 2, 2019

nswamy Apr 2, 2019

roywei commented Apr 2, 2019

piyushghai commented Apr 2, 2019

roywei commented Apr 2, 2019

		handler.train_end()

		def categorize_handlers(self, event_handlers):

[MXNet-1340][Fit API]Update train stats #14494

[MXNet-1340][Fit API]Update train stats #14494

Conversation

roywei commented Mar 21, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pinaraws commented Mar 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roywei Apr 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roywei commented Apr 2, 2019

piyushghai commented Apr 2, 2019

roywei commented Apr 2, 2019

roywei commented Mar 21, 2019 •

edited

Loading

roywei Apr 2, 2019 •

edited

Loading