[MXNet-1343][Fit API]Add CNN integration test for fit() API #14405

abhinavs95 · 2019-03-12T22:59:58Z

Description

Add nightly integration tests for fit() API using CNN models.
This PR depends on the parent PR for fit() API #14346
JIRA epic: https://issues.apache.org/jira/projects/MXNET/issues/MXNET-1333

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with MXNet-1343
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

test_estimator_cnn_cpu: test on cpu using synthetic data on alexnet, resnet18_v1, FCN
test_estimator_cnn_gpu: test on gpu using MNIST dataset on resnet18_v1 and verify training accuracy

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

karan6181 · 2019-03-12T23:45:46Z

ci/docker/runtime_functions.sh

+    set -ex
+    cd /work/mxnet/tests/nightly/estimator
+    export PYTHONPATH=/work/mxnet/python/
+    python tests/nightly/estimator/test_estimator_cnn_gpu.py


Is the test_estimator_cnn_gpu.py file path correct? The current shell location is already inside /work/mxnet/tests/nightly/estimator directory.

karan6181 · 2019-03-12T23:47:06Z

tests/nightly/JenkinsfileForBinaries

+    'estimator: CNN CPU': {
+      node(NODE_LINUX_CPU) {
+        ws('workspace/estimator-test-cnn-cpu') {
+          utils.unpack_and_init('gpu', mx_lib)


nit: 'cpu' ??

karan6181 · 2019-03-12T23:59:49Z

tests/nightly/estimator/test_estimator_cnn_cpu.py

+                                  trainers=trainer,
+                                  context=context)
+        # Call fit() to begin training
+        logging_handler = event_handler.LoggingHandler(est, model_name+'_log', model_name+'_log')


LoggingHandler Class constructor says def __init__(self, estimator, file_name=None, file_location=None, ):, is file_location name correct?(e.g.: alexnet_log/alexnet_log). I think you have to provide file_location as a directory path or you can leave it empty since code takes care of it. Thanks!

Thanks for pointing it out. Removing logging handler from integration tests for now as its undergoing changes. It will be covered in unit tests.

karan6181 · 2019-03-13T00:02:06Z

tests/nightly/estimator/test_estimator_cnn_gpu.py

+                              trainers=trainer,
+                              context=context)
+    # Call fit() to begin training
+    logging_handler = event_handler.LoggingHandler(est, model_name+'_log', model_name+'_log')


same as above comment.

same as above

karan6181 · 2019-03-13T00:06:36Z

@mxnet-label-bot add [Gluon, pr-work-in-progress]

roywei

Thank for your contribution, as these tests are not running in PR checks, let's test them before merging.

roywei · 2019-03-18T18:30:02Z

tests/nightly/estimator/test_estimator_cnn_gpu.py

+                              context=context)
+    # Call fit() to begin training
+    est.fit(train_data=train_data,
+            # val_data=test_data,


add validation data here once #14442 merged

added in 810faef

roywei · 2019-03-18T18:34:30Z

tests/nightly/estimator/test_estimator_cnn_cpu.py

+            dataset = gluon.data.dataset.ArrayDataset(mx.nd.random.uniform(shape=(batch_size, 1, 224, 224)),
+                                                      mx.nd.zeros(batch_size))
+            loss = gluon.loss.SoftmaxCrossEntropyLoss()
+            net.initialize(mx.init.MSRAPrelu(), ctx=context)


let's use a more common initializer

Changed to Xavier initializer

roywei · 2019-03-18T18:35:12Z

tests/nightly/estimator/test_estimator_cnn_cpu.py

+
+def bilinear_kernel(in_channels, out_channels, kernel_size):
+    '''
+    Bilinear interpolation using transposed convolution


reference for this implementation?

Added link to reference code

roywei · 2019-03-18T18:39:31Z

tests/nightly/estimator/test_estimator_cnn_gpu.py

+from mxnet.gluon.model_zoo import vision
+
+def load_data_mnist(batch_size, resize=None, num_workers=None,
+                    root=os.path.join('~', '.mxnet', 'datasets', 'mnist')):


remove root, let's not keep the data under ~/.mxnet in nightly tests.

…nto fit-api-test

karan6181 · 2019-03-20T21:53:31Z

tests/nightly/estimator/test_estimator_cnn_cpu.py

+                                  context=context)
+        # Call fit() to begin training
+        est.fit(train_data=train_data,
+                val_data=train_data,


Can you please create another random validation data and pass it here instead of using same train_data for both train and validation. Thanks!

roywei

LTGM! let wait for CI back to normal. also try to run this locally to make sure it passes

…nto fit-api-test

abhinavs95 · 2019-03-26T21:39:22Z

@mxnet-label-bot update [Gluon, Test, pr-awaiting-merge]

nswamy · 2019-03-27T19:21:25Z

tests/nightly/estimator/test_estimator_cnn_cpu.py

+            val_dataset = gluon.data.dataset.ArrayDataset(mx.nd.random.uniform(shape=(batch_size, 3, 320, 480)),
+                                                      mx.nd.zeros(shape=(batch_size, 320, 480)))
+            loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1)
+            net[-1].initialize(init.Constant(bilinear_kernel(num_classes, num_classes, 64)), ctx=context)


why not get the net from directly from FCN method?, why split the logic?.

Updated logic, please have a look again

nswamy · 2019-03-27T19:22:13Z

tests/nightly/estimator/test_estimator_cnn_gpu.py

+    input_size = 224
+    lr = 0.001
+    # Set context
+    if mx.context.num_gpus() > 0:


why not use all GPUs?

The GPU test just runs 5 epochs on MNIST, I think 1 GPU is enough.

nswamy · 2019-03-27T19:23:59Z

tests/nightly/estimator/test_estimator_cnn_gpu.py

+from mxnet.gluon.estimator import estimator, event_handler
+from mxnet.gluon.model_zoo import vision
+
+def load_data_mnist(batch_size, resize=None, num_workers=None):


why do you have we need 2 different files?

combined tests into 1 file

nswamy · 2019-03-27T19:30:04Z

tests/nightly/estimator/test_estimator_cnn_cpu.py

+    Test estimator by doing one pass over each model with synthetic data
+    '''
+    models = ['resnet18_v1',
+              'alexnet',


i don't think we need alexnet here since the fit is not very different than resnet.

removed alexnet

roywei · 2019-03-28T16:45:09Z

ci/docker/runtime_functions.sh

+    set -ex
+    cd /work/mxnet/tests/nightly/estimator
+    export PYTHONPATH=/work/mxnet/python/
+    python test_estimator_cnn.py --type gpu


what's the reason to run a python script instead of using nosetest and assert accuracy at the end?

Since we are executing a single test here, running the python script suffices. We can shift to nosetests if we add more tests in the future.

piyushghai · 2019-03-28T18:59:03Z

tests/nightly/JenkinsfileForBinaries

@@ -106,6 +106,22 @@ core_logic: {
          utils.docker_run('ubuntu_nightly_gpu', 'nightly_tutorial_test_ubuntu_python3_gpu', true, '1500m')
        }
      }
+    },


IMO, This should be in nightly/Jenkinsfile and not the one for Binaries

I think it should work in either Jenkins file, I placed it in binaries since it contained similar existing nightly tests. I will keep your suggestion in mind and update with a follow up PR in case there are any issues with the current setup.

piyushghai · 2019-03-28T19:02:03Z

tests/nightly/estimator/test_estimator_cnn.py

+    model_name = 'resnet18_v1'
+    batch_size = 128
+    num_epochs = 5
+    if mx.context.num_gpus() > 0:


You should not fallback to mx.cpu here since you want to test on gpu only in this test.

I do not want the test to fail if GPU is not available, it should pass on either context.

we want it to fail and know if there is an issue, could you please change it.

deleted "Also use all the GPUs, so if we run this same code elsewhere we can make use of it without having to change this code."

ignore the comment on using all GPUs, we can do it as required.

Updated context to use GPU

nswamy · 2019-04-03T21:09:30Z

tests/nightly/estimator/test_estimator_cnn.py

+            epochs=num_epochs,
+            batch_size=batch_size)
+
+    assert est.train_stats['train_'+acc.name][num_epochs-1] > 0.75


why is this only 75%?

After 5 epochs the training accuracy is ~85%. I have made it more strict by setting it to 80%. What do you think?

…4405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context

* [MXNet-1334][Fit API]base class for estimator and eventhandler (#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (#14633) * move to gluon contrib (#14635) * [Fit API] improve event handlers (#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

* added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context

* [MXNet-1334][Fit API]base class for estimator and eventhandler (apache#14346) * base class for estimator and eventhandler * add license * add event handlers * fix pylint * improve arg check * fix pylint * add unit tests * Fixed issue where the estimator was printing beyond the dataset size … (apache#14464) * Fixed issue where the estimator was printing beyond the dataset size for the last batch * Added comments * Nudge to CI * [MXNet-1349][Fit API]Add validation support and unit tests for fit() API (apache#14442) * added estimator unittests * add more tests for estimator * added validation logic * added error handlers, unittests * improve val stats * fix pylint * fix pylint * update unit test * fix tests * fix tests * updated metrics, val logic * trigger ci * trigger ci * update metric, batch_fn error handler * update context logic, add default metric * [MXNet-1340][Fit API]Update train stats (apache#14494) * add train history * update history * update test * avoid calling empty methods * remove train history object * fix pylint * add unit test * fix test * update categorize handlers * [MXNet-1375][Fit API]Added RNN integration test for fit() API (apache#14547) * Added RNN integration test for fit() API * Addressed review comments: change in JenkinFile, tmp directory, ctx with condense if/else, renamed imports * CPU test doesn't require nvidiadocker container * Modified the structure by removing the redundant code * [MXNet-1343][Fit API]Add CNN integration test for fit() API (apache#14405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context * [MXNET-1344, 1346][FIT API] Retrieve Batch size and Logging verbose support for Gluon fit() API (apache#14587) * Retrieve Batch size and Logging verbose support for Gluon fit() API * NIT changes * Addressed review comments: shifted the batch size code to a separate method, sentence correction * Modified unittest * removed redundant parameter * Resolve CI test failure * only support DataLoader for now, future PRs will include DataIter to DataLoader converter * Get the number of samples from shape attribute instead of length due to low space complexity * Simplified batch size retrieval code * removed batch_size parameter from fit() method and fixed the tests * Verbose exception handling * Assigning constant to a verbose * Modified exception message * Resolved undefined class reference * Addressed review comments: Modified verbose level names, docs, variable names * Update estimator.py * move estimator to contrib (apache#14633) * move to gluon contrib (apache#14635) * [Fit API] improve event handlers (apache#14685) * improve event handlers * update tests * passing weakref of estimator * fix unit test * fix test * fix pylint * fix test * fix pylint * move default metric logic * combine nightly tests * [MXNET-1396][Fit-API] Update default handler logic (apache#14765) * move to nightly for binaries * update default handler * fix pylint * trigger ci * trigger ci * [Fit API] update estimator (apache#14849) * address comments * add comment * check available context * fix bug * change cpu check * [Fit-API] Adress PR comments (apache#14885) * address comments * update checkpoint * test symbol save * address comments * add resume * update doc and resume checkpoint * update docs * trigger ci * trigger ci

…4405) * added cnn intg tests for fit api * updated cnn intg tests * added functions for nightly test * updated runtime_function * updated intg tests * updated init, datapath, refs * added validation data * update cpu test * refactor code * updated context

abhinavs95 added 4 commits March 11, 2019 12:56

added cnn intg tests for fit api

12549e7

updated cnn intg tests

67c3cb2

added functions for nightly test

424c8e7

updated runtime_function

1f839ce

karan6181 reviewed Mar 12, 2019

View reviewed changes

karan6181 reviewed Mar 13, 2019

View reviewed changes

marcoabreu added Gluon pr-work-in-progress PR is still work in progress labels Mar 13, 2019

updated intg tests

a632b11

roywei reviewed Mar 18, 2019

View reviewed changes

abhinavs95 added 2 commits March 18, 2019 13:01

Merge branch 'fit-api' of /~https://github.com/apache/incubator-mxnet i…

996c1bc

…nto fit-api-test

updated init, datapath, refs

e227d00

nswamy changed the title ~~[MXNet-1343][WIP][Fit API]Add CNN integration test for fit() API~~ [MXNet-1343][Fit API]Add CNN integration test for fit() API Mar 19, 2019

added validation data

810faef

karan6181 reviewed Mar 20, 2019

View reviewed changes

update cpu test

e626320

roywei approved these changes Mar 20, 2019

View reviewed changes

Merge branch 'fit-api' of /~https://github.com/apache/incubator-mxnet i…

f565737

…nto fit-api-test

marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge Test and removed pr-work-in-progress PR is still work in progress labels Mar 26, 2019

nswamy reviewed Mar 27, 2019

View reviewed changes

refactor code

619b786

roywei reviewed Mar 28, 2019

View reviewed changes

piyushghai reviewed Mar 28, 2019

View reviewed changes

nswamy reviewed Apr 3, 2019

View reviewed changes

abhinavs95 added 2 commits April 3, 2019 14:45

updated context

9bd83c5

Merge branch 'fit-api' into fit-api-test

f4abea1

nswamy merged commit b1ef99a into apache:fit-api Apr 3, 2019

[MXNet-1343][Fit API]Add CNN integration test for fit() API #14405

[MXNet-1343][Fit API]Add CNN integration test for fit() API #14405

Conversation

abhinavs95 commented Mar 12, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karan6181 commented Mar 13, 2019

roywei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roywei left a comment • edited Loading

Choose a reason for hiding this comment

abhinavs95 commented Mar 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nswamy Apr 3, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhinavs95 Apr 3, 2019 • edited Loading

Choose a reason for hiding this comment

abhinavs95 commented Mar 12, 2019 •

edited

Loading

roywei left a comment •

edited

Loading

nswamy Apr 3, 2019 •

edited

Loading

abhinavs95 Apr 3, 2019 •

edited

Loading