Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNet-1343][Fit API]Add CNN integration test for fit() API #14405

Merged
merged 13 commits into from
Apr 3, 2019

Conversation

abhinavs95
Copy link
Contributor

@abhinavs95 abhinavs95 commented Mar 12, 2019

Description

Add nightly integration tests for fit() API using CNN models.
This PR depends on the parent PR for fit() API #14346
JIRA epic: https://issues.apache.org/jira/projects/MXNET/issues/MXNET-1333

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with MXNet-1343
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • test_estimator_cnn_cpu: test on cpu using synthetic data on alexnet, resnet18_v1, FCN
  • test_estimator_cnn_gpu: test on gpu using MNIST dataset on resnet18_v1 and verify training accuracy

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

set -ex
cd /work/mxnet/tests/nightly/estimator
export PYTHONPATH=/work/mxnet/python/
python tests/nightly/estimator/test_estimator_cnn_gpu.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the test_estimator_cnn_gpu.py file path correct? The current shell location is already inside /work/mxnet/tests/nightly/estimator directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

'estimator: CNN CPU': {
node(NODE_LINUX_CPU) {
ws('workspace/estimator-test-cnn-cpu') {
utils.unpack_and_init('gpu', mx_lib)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'cpu' ??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

trainers=trainer,
context=context)
# Call fit() to begin training
logging_handler = event_handler.LoggingHandler(est, model_name+'_log', model_name+'_log')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoggingHandler Class constructor says def __init__(self, estimator, file_name=None, file_location=None, ):, is file_location name correct?(e.g.: alexnet_log/alexnet_log). I think you have to provide file_location as a directory path or you can leave it empty since code takes care of it. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. Removing logging handler from integration tests for now as its undergoing changes. It will be covered in unit tests.

trainers=trainer,
context=context)
# Call fit() to begin training
logging_handler = event_handler.LoggingHandler(est, model_name+'_log', model_name+'_log')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@karan6181
Copy link
Contributor

@mxnet-label-bot add [Gluon, pr-work-in-progress]

@marcoabreu marcoabreu added Gluon pr-work-in-progress PR is still work in progress labels Mar 13, 2019
Copy link
Member

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your contribution, as these tests are not running in PR checks, let's test them before merging.

context=context)
# Call fit() to begin training
est.fit(train_data=train_data,
# val_data=test_data,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add validation data here once #14442 merged

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 810faef

dataset = gluon.data.dataset.ArrayDataset(mx.nd.random.uniform(shape=(batch_size, 1, 224, 224)),
mx.nd.zeros(batch_size))
loss = gluon.loss.SoftmaxCrossEntropyLoss()
net.initialize(mx.init.MSRAPrelu(), ctx=context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use a more common initializer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Xavier initializer


def bilinear_kernel(in_channels, out_channels, kernel_size):
'''
Bilinear interpolation using transposed convolution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference for this implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link to reference code

from mxnet.gluon.model_zoo import vision

def load_data_mnist(batch_size, resize=None, num_workers=None,
root=os.path.join('~', '.mxnet', 'datasets', 'mnist')):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove root, let's not keep the data under ~/.mxnet in nightly tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@nswamy nswamy changed the title [MXNet-1343][WIP][Fit API]Add CNN integration test for fit() API [MXNet-1343][Fit API]Add CNN integration test for fit() API Mar 19, 2019
context=context)
# Call fit() to begin training
est.fit(train_data=train_data,
val_data=train_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please create another random validation data and pass it here instead of using same train_data for both train and validation. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

@roywei roywei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LTGM! let wait for CI back to normal. also try to run this locally to make sure it passes

@abhinavs95
Copy link
Contributor Author

@mxnet-label-bot update [Gluon, Test, pr-awaiting-merge]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge Test and removed pr-work-in-progress PR is still work in progress labels Mar 26, 2019
val_dataset = gluon.data.dataset.ArrayDataset(mx.nd.random.uniform(shape=(batch_size, 3, 320, 480)),
mx.nd.zeros(shape=(batch_size, 320, 480)))
loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
net[-1].initialize(init.Constant(bilinear_kernel(num_classes, num_classes, 64)), ctx=context)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not get the net from directly from FCN method?, why split the logic?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated logic, please have a look again

input_size = 224
lr = 0.001
# Set context
if mx.context.num_gpus() > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use all GPUs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU test just runs 5 epochs on MNIST, I think 1 GPU is enough.

from mxnet.gluon.estimator import estimator, event_handler
from mxnet.gluon.model_zoo import vision

def load_data_mnist(batch_size, resize=None, num_workers=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have we need 2 different files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined tests into 1 file

Test estimator by doing one pass over each model with synthetic data
'''
models = ['resnet18_v1',
'alexnet',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need alexnet here since the fit is not very different than resnet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed alexnet

set -ex
cd /work/mxnet/tests/nightly/estimator
export PYTHONPATH=/work/mxnet/python/
python test_estimator_cnn.py --type gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to run a python script instead of using nosetest and assert accuracy at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are executing a single test here, running the python script suffices. We can shift to nosetests if we add more tests in the future.

@@ -106,6 +106,22 @@ core_logic: {
utils.docker_run('ubuntu_nightly_gpu', 'nightly_tutorial_test_ubuntu_python3_gpu', true, '1500m')
}
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, This should be in nightly/Jenkinsfile and not the one for Binaries

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work in either Jenkins file, I placed it in binaries since it contained similar existing nightly tests. I will keep your suggestion in mind and update with a follow up PR in case there are any issues with the current setup.

model_name = 'resnet18_v1'
batch_size = 128
num_epochs = 5
if mx.context.num_gpus() > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not fallback to mx.cpu here since you want to test on gpu only in this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want the test to fail if GPU is not available, it should pass on either context.

Copy link
Member

@nswamy nswamy Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want it to fail and know if there is an issue, could you please change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted "Also use all the GPUs, so if we run this same code elsewhere we can make use of it without having to change this code."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore the comment on using all GPUs, we can do it as required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated context to use GPU

epochs=num_epochs,
batch_size=batch_size)

assert est.train_stats['train_'+acc.name][num_epochs-1] > 0.75
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this only 75%?

Copy link
Contributor Author

@abhinavs95 abhinavs95 Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After 5 epochs the training accuracy is ~85%. I have made it more strict by setting it to 80%. What do you think?

@nswamy nswamy merged commit b1ef99a into apache:fit-api Apr 3, 2019
piyushghai pushed a commit to piyushghai/incubator-mxnet that referenced this pull request Apr 5, 2019
…4405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context
nswamy pushed a commit to nswamy/incubator-mxnet that referenced this pull request Apr 5, 2019
…4405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context
roywei pushed a commit to roywei/incubator-mxnet that referenced this pull request May 15, 2019
…4405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context
roywei pushed a commit to roywei/incubator-mxnet that referenced this pull request May 15, 2019
…4405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context
szha pushed a commit that referenced this pull request May 18, 2019
* [MXNet-1334][Fit API]base class for estimator and eventhandler (#14346)

* base class for estimator and eventhandler

* add license

* add event handlers

* fix pylint

* improve arg check

* fix pylint

* add unit tests

* Fixed issue where the estimator was printing beyond the dataset size … (#14464)

* Fixed issue where the estimator was printing beyond the dataset size for the last batch

* Added comments

* Nudge to CI

* [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (#14442)

* added estimator unittests

* add more tests for estimator

* added validation logic

* added error handlers, unittests

* improve val stats

* fix pylint

* fix pylint

* update unit test

* fix tests

* fix tests

* updated metrics, val logic

* trigger ci

* trigger ci

* update metric, batch_fn error handler

* update context logic, add default metric

* [MXNet-1340][Fit API]Update train stats (#14494)

* add train history

* update history

* update test

* avoid calling empty methods

* remove train history object

* fix pylint

* add unit test

* fix test

* update categorize handlers

* [MXNet-1375][Fit API]Added RNN integration test for fit() API (#14547)

* Added RNN integration test for fit() API

* Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports

* CPU test doesn't require nvidiadocker container

* Modified the structure by removing the redundant code

* [MXNet-1343][Fit API]Add CNN integration test for fit() API (#14405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context

* [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (#14587)

* Retrieve Batch size and Logging verbose support for Gluon fit() API

* NIT changes

* Addressed review comments: shifted the batch size code to a separate method, sentence correction

* Modified unittest

* removed redundant parameter

* Resolve CI test failure

* only support DataLoader for now, future PRs will include DataIter to DataLoader converter

* Get the number of samples from shape attribute instead of length due to low space complexity

* Simplified batch size retrieval code

* removed batch_size parameter from fit() method and fixed the tests

* Verbose exception handling

* Assigning constant to a verbose

* Modified exception message

* Resolved undefined class reference

* Addressed review comments: Modified verbose level names, docs, variable names

* Update estimator.py

* move estimator to contrib (#14633)

* move to gluon contrib (#14635)

* [Fit API] improve event handlers (#14685)

* improve event handlers

* update tests

* passing weakref of estimator

* fix unit test

* fix test

* fix pylint

* fix test

* fix pylint

* move default metric logic

* combine nightly tests

* [MXNET-1396][Fit-API] Update default handler logic (#14765)

* move to nightly for binaries

* update default handler

* fix pylint

* trigger ci

* trigger ci

* [Fit API] update estimator (#14849)

* address comments

* add comment

* check available context

* fix bug

* change cpu check

* [Fit-API] Adress PR comments (#14885)

* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
szha pushed a commit that referenced this pull request May 20, 2019
* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* [MXNet-1334][Fit API]base class for estimator and eventhandler (apache#14346)

* base class for estimator and eventhandler

* add license

* add event handlers

* fix pylint

* improve arg check

* fix pylint

* add unit tests

* Fixed issue where the estimator was printing beyond the dataset size … (apache#14464)

* Fixed issue where the estimator was printing beyond the dataset size for the last batch

* Added comments

* Nudge to CI

* [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (apache#14442)

* added estimator unittests

* add more tests for estimator

* added validation logic

* added error handlers, unittests

* improve val stats

* fix pylint

* fix pylint

* update unit test

* fix tests

* fix tests

* updated metrics, val logic

* trigger ci

* trigger ci

* update metric, batch_fn error handler

* update context logic, add default metric

* [MXNet-1340][Fit API]Update train stats (apache#14494)

* add train history

* update history

* update test

* avoid calling empty methods

* remove train history object

* fix pylint

* add unit test

* fix test

* update categorize handlers

* [MXNet-1375][Fit API]Added RNN integration test for fit() API (apache#14547)

* Added RNN integration test for fit() API

* Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports

* CPU test doesn't require nvidiadocker container

* Modified the structure by removing the redundant code

* [MXNet-1343][Fit API]Add CNN integration test for fit() API (apache#14405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context

* [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (apache#14587)

* Retrieve Batch size and Logging verbose support for Gluon fit() API

* NIT changes

* Addressed review comments: shifted the batch size code to a separate method, sentence correction

* Modified unittest

* removed redundant parameter

* Resolve CI test failure

* only support DataLoader for now, future PRs will include DataIter to DataLoader converter

* Get the number of samples from shape attribute instead of length due to low space complexity

* Simplified batch size retrieval code

* removed batch_size parameter from fit() method and fixed the tests

* Verbose exception handling

* Assigning constant to a verbose

* Modified exception message

* Resolved undefined class reference

* Addressed review comments: Modified verbose level names, docs, variable names

* Update estimator.py

* move estimator to contrib (apache#14633)

* move to gluon contrib (apache#14635)

* [Fit API] improve event handlers (apache#14685)

* improve event handlers

* update tests

* passing weakref of estimator

* fix unit test

* fix test

* fix pylint

* fix test

* fix pylint

* move default metric logic

* combine nightly tests

* [MXNET-1396][Fit-API] Update default handler logic (apache#14765)

* move to nightly for binaries

* update default handler

* fix pylint

* trigger ci

* trigger ci

* [Fit API] update estimator (apache#14849)

* address comments

* add comment

* check available context

* fix bug

* change cpu check

* [Fit-API] Adress PR comments (apache#14885)

* address comments

* update checkpoint

* test symbol save

* address comments

* add resume

* update doc and resume checkpoint

* update docs

* trigger ci

* trigger ci
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…4405)

* added cnn intg tests for fit api

* updated cnn intg tests

* added functions for nightly test

* updated runtime_function

* updated intg tests

* updated init, datapath, refs

* added validation data

* update cpu test

* refactor code

* updated context
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Gluon pr-awaiting-merge Review and CI is complete. Ready to Merge Test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants