Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Fit-API] Adress PR comments #14885

Merged
merged 9 commits into from
May 14, 2019
Merged

Conversation

roywei
Copy link
Member

@roywei roywei commented May 6, 2019

Description

To address PR comments in #14629

  1. Check train_data and val_data only supports dataloader, give proper error msg
  2. exposed batch_axis and set default 0
  3. updated initialization logic, so fine-tuning use case (net partially initialized) will not get forced re-init.
  4. explained auto mode in checkpoint and early stopping, added docstring, warning and logging message when auto mode is used and np.greater/np.less is determined in auto mode.
  5. log learning rate every epoch
  6. update logic to save checkpoint based on epoch_period(default 1) and batch_period (default None)
  7. update checkpoint handler to save the following: symbol (1 time after first batch, if net is hybridizable), params and states per epcoh or batch, only keep a maximum number of checkpoints(default 5)
  8. added feature to resume training from checkpoints

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@roywei roywei requested a review from szha as a code owner May 6, 2019 00:49
@szha szha requested a review from eric-haibin-lin May 6, 2019 00:55
@anirudhacharya
Copy link
Member

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label May 6, 2019
@roywei
Copy link
Member Author

roywei commented May 9, 2019

@eric-haibin-lin addressed comments, added option to resume from checkpoint. Will load params and trainer states at train begin if checkpoint files detected. Will train the remain of the epochs.

@eric-haibin-lin
Copy link
Member

LGTM. My comments are addressed.

@eric-haibin-lin eric-haibin-lin merged commit 588730c into apache:fit-api May 14, 2019
roywei added a commit to roywei/incubator-mxnet that referenced this pull request May 15, 2019
* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
roywei added a commit to roywei/incubator-mxnet that referenced this pull request May 15, 2019
* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
eric-haibin-lin pushed a commit that referenced this pull request May 17, 2019
* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
szha pushed a commit that referenced this pull request May 18, 2019
* [MXNet-1334][Fit API]base class for estimator and eventhandler (#14346)

* base class for estimator and eventhandler

* add license

* add event handlers

* fix pylint

* improve arg check

* fix pylint

* add unit tests

* Fixed issue where the estimator was printing beyond the dataset size … (#14464)

* Fixed issue where the estimator was printing beyond the dataset size for the last batch

* Added comments

* Nudge to CI

* [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (#14442)

* added estimator unittests

* add more tests for estimator

* added validation logic

* added error handlers, unittests

* improve val stats

* fix pylint

* fix pylint

* update unit test

* fix tests

* fix tests

* updated metrics, val logic

* trigger ci

* trigger ci

* update metric, batch_fn error handler

* update context logic, add default metric

* [MXNet-1340][Fit API]Update train stats (#14494)

* add train history

* update history

* update test

* avoid calling empty methods

* remove train history object

* fix pylint

* add unit test

* fix test

* update categorize handlers

* [MXNet-1375][Fit API]Added RNN integration test for fit() API (#14547)

* Added RNN integration test for fit() API

* Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports

* CPU test doesn't require nvidiadocker container

* Modified the structure by removing the redundant code

* [MXNet-1343][Fit API]Add CNN integration test for fit() API (#14405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context

* [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (#14587)

* Retrieve Batch size and Logging verbose support for Gluon fit() API

* NIT changes

* Addressed review comments: shifted the batch size code to a separate method, sentence correction

* Modified unittest

* removed redundant parameter

* Resolve CI test failure

* only support DataLoader for now, future PRs will include DataIter to DataLoader converter

* Get the number of samples from shape attribute instead of length due to low space complexity

* Simplified batch size retrieval code

* removed batch_size parameter from fit() method and fixed the tests

* Verbose exception handling

* Assigning constant to a verbose

* Modified exception message

* Resolved undefined class reference

* Addressed review comments: Modified verbose level names, docs, variable names

* Update estimator.py

* move estimator to contrib (#14633)

* move to gluon contrib (#14635)

* [Fit API] improve event handlers (#14685)

* improve event handlers

* update tests

* passing weakref of estimator

* fix unit test

* fix test

* fix pylint

* fix test

* fix pylint

* move default metric logic

* combine nightly tests

* [MXNET-1396][Fit-API] Update default handler logic (#14765)

* move to nightly for binaries

* update default handler

* fix pylint

* trigger ci

* trigger ci

* [Fit API] update estimator (#14849)

* address comments

* add comment

* check available context

* fix bug

* change cpu check

* [Fit-API] Adress PR comments (#14885)

* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
szha pushed a commit that referenced this pull request May 20, 2019
* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* [MXNet-1334][Fit API]base class for estimator and eventhandler (apache#14346)

* base class for estimator and eventhandler

* add license

* add event handlers

* fix pylint

* improve arg check

* fix pylint

* add unit tests

* Fixed issue where the estimator was printing beyond the dataset size … (apache#14464)

* Fixed issue where the estimator was printing beyond the dataset size for the last batch

* Added comments

* Nudge to CI

* [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (apache#14442)

* added estimator unittests

* add more tests for estimator

* added validation logic

* added error handlers, unittests

* improve val stats

* fix pylint

* fix pylint

* update unit test

* fix tests

* fix tests

* updated metrics, val logic

* trigger ci

* trigger ci

* update metric, batch_fn error handler

* update context logic, add default metric

* [MXNet-1340][Fit API]Update train stats (apache#14494)

* add train history

* update history

* update test

* avoid calling empty methods

* remove train history object

* fix pylint

* add unit test

* fix test

* update categorize handlers

* [MXNet-1375][Fit API]Added RNN integration test for fit() API (apache#14547)

* Added RNN integration test for fit() API

* Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports

* CPU test doesn't require nvidiadocker container

* Modified the structure by removing the redundant code

* [MXNet-1343][Fit API]Add CNN integration test for fit() API (apache#14405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context

* [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (apache#14587)

* Retrieve Batch size and Logging verbose support for Gluon fit() API

* NIT changes

* Addressed review comments: shifted the batch size code to a separate method, sentence correction

* Modified unittest

* removed redundant parameter

* Resolve CI test failure

* only support DataLoader for now, future PRs will include DataIter to DataLoader converter

* Get the number of samples from shape attribute instead of length due to low space complexity

* Simplified batch size retrieval code

* removed batch_size parameter from fit() method and fixed the tests

* Verbose exception handling

* Assigning constant to a verbose

* Modified exception message

* Resolved undefined class reference

* Addressed review comments: Modified verbose level names, docs, variable names

* Update estimator.py

* move estimator to contrib (apache#14633)

* move to gluon contrib (apache#14635)

* [Fit API] improve event handlers (apache#14685)

* improve event handlers

* update tests

* passing weakref of estimator

* fix unit test

* fix test

* fix pylint

* fix test

* fix pylint

* move default metric logic

* combine nightly tests

* [MXNET-1396][Fit-API] Update default handler logic (apache#14765)

* move to nightly for binaries

* update default handler

* fix pylint

* trigger ci

* trigger ci

* [Fit API] update estimator (apache#14849)

* address comments

* add comment

* check available context

* fix bug

* change cpu check

* [Fit-API] Adress PR comments (apache#14885)

* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants