Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

CI test randomness 3 #9791

Merged
merged 12 commits into from
Feb 18, 2018
Merged

CI test randomness 3 #9791

merged 12 commits into from
Feb 18, 2018

Conversation

DickJC123
Copy link
Contributor

@DickJC123 DickJC123 commented Feb 14, 2018

Description

This is a rebasing and partial squashing of the stale "ci test randomness2" PR #8526 . It's based on the premise that the CI should test the framework the way the users do- with lots of different data. Having tests run with random data has been problematic though, with flaky tests being a constant annoyance. This PR supplies the tools to track down the flakiness by explicitly seeding all tests that use random data, and reporting those seeds if the test fails. The first commit of the PR rolls out the with_seed() decorator to tests under the tests/python/unittests directory with some hard-coded seeds designed to expose existing flakiness. Follow-up commits will supply known fixes. Further commits will touch the tests/python/gpu directory tests.

From PR #8526:

This PR introduces a new simple level of control over unittest random seeds while providing inter-test random number generator (RNG) isolation. This PR is an improved replacement to the pending PR #8313. The improvements over that PR are:

  • A unittest that fails via an exception will have its seed reported in the test log. Reproducing the failure with the same-seeded data is simple and immediate.
  • The mx.random and python random seeds are also set (identically to np.random) giving deterministic behavior and test isolation for mxnet cpu and gpu RNG's.
  • A unittest failure via a core-dump can also be reproduced after the module test is re-run with debugging enabled.

To provide this functionality, a custom decorator "@with_seed()" was created. This was considered more powerful than the nosetests "@with_setup()" facility, and less disruptive than changing all tests to become methods of a nosetests test class. The proposed new approach is demonstrated on a simple "test_module.py" test file of three tests. Assuming that the second test needs a set seed for robustness, the file might currently appear as:

def test_op1():
    <op1 test>

def test_op2():
    np.random.seed(1234)
    <op2 test>

def test_op3():
    <op3 test>

Even though test_op3() is OK with nondeterministic data, it will have only a single dataset because it is run after test_op2, which sets the seed. Also, if test_op1() were to fail, there would be no way to reproduce the failure, except for running the test individually to produce a new and hopefully similar failure.

With the proposed approach, the test file becomes:

from common import setup_module, with_seed

@with_seed()
def test_op1():
    <op1 test>

@with_seed(1234)
def test_op2():
    <op2 test>

@with_seed()
def test_op3():
    <op3 test>

By importing unittests/common.py, the seeds of the numpy and mxnet RNGs are set initially to a random "module seed" that is output to the log file. The initial RNGs can be thought of as module-level RNGs that are isolated from the individual tests and that provide the string of "test seeds" that determine the behavior of each test's RNGs. The "@with_seed()" test function decorator requests a test seed from the module RNG, sets the numpy, python and mxnet seeds appropriately and runs the test. Should the test fail, the seed for that test is output to the log file. Pass or fail, the decorator reinstates the module RNG state before the next test's decorator is executed, effectively isolating the tests. Debugging a failing test_op3 in the example would proceed as follows:

$ nosetests --verbose -s test_module.py
[INFO] Setting module np/mx random seeds, use MXNET_MODULE_SEED=3444154063 to reproduce.
test_module.test_op1 ... ok
test_module.test_op2 ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
ok
test_module.test_op3 ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2096230603 to reproduce.
FAIL
======================================================================
FAIL: test_module.test_op3
----------------------------------------------------------------------
Traceback (most recent call last):
<stack trace appears here>
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2096230603 to reproduce.
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 3 tests in 1.354s
FAILED (failures=1)

Because test_op3 failed, its seed appeared in the log file. Also, the test_op2 seed was displayed as a reminder that that test needs more work before it is robust enough for random data. The command to reproduce the problem is produced simply by cutting and pasting from the log:

$ MXNET_TEST_SEED=2096230603 nostests --verbose -s test_module.py:test_op3

If test_op3 instead dumped core, the test seed would not be initially apparent. Assuming the core-dump is repeatable based on the data, the module would first be re-run with the command:

$ MXNET_MODULE_SEED=3444154063 nostests --logging-level=DEBUG --verbose -s test_module.py

The log would now include the test seeds for all tests before they are run, so the test could then be run in isolation as before with MXNET_TEST_SEED=2096230603.

Let's assume that test_op3 was altered by increasing a tolerance. How robust is the test now? This can be explored by repeating the test many times as in:

$ MXNET_TEST_COUNT=10000 nostests --logging-level=DEBUG --verbose -s test_module.py:test_op3

Finally, this PR adds the @with_seed() decorator for all tests in modules that use random numbers. Also, it includes many specific test robustness fixes that were exposed once this new methodology was adopted internally by the author.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@DickJC123 DickJC123 requested a review from szha as a code owner February 14, 2018 01:48
@marcoabreu
Copy link
Contributor

Very nice feature, thanks a lot for putting so much effort into this!

"""

try:
next_seed = np.random.randint(0, np.iinfo(np.int32).max)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate how this restores the pre-block state? To me this looks like it will always pick a new random seed after this block finishes. I rather expected that the old sequence generator is being restored. Imagine two sequences:
1: 100 -> 101 -> 102 -> ...
2: 200 -> 201 -> 202 -> ...

What I would expect is something along the lines of:

100
101
START: [random_seed(seq_2)]
200
201
202
END: [random_seed(seq_2)]
102
103
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're asking about the "with random_seed(NNN)" construct (as opposed to the with_seed() decorator. One issue faced here is that the mxnet rng state is not save-and-restorable, so we can't achieve exactly what you were expecting. What's promised is this:

    Upon conclusion of the block, the rng's are returned to
    a state that is a function of their pre-block state, so
    any prior non-determinism is preserved.

So 'with random_seed(NNN)' forces determinism on the random values seen inside its block, without changing the determinism/non-determinism of the values seen outside the block. Likewise, 'with random_seed()' (so no numeric arg) forces non-determinism on the random values seen inside its block, without changing the determinism/non-determinism of the values seen inside the block.

Let's take a use-case that explores this feature: a test that initially gets both random shapes and random data to populate tensors of those shapes. Let's say the test is flaky and it's determined that it's best to work with a known set of shapes. The test could be then coded as:

@with_seed()
def test_stuff():
    # seed 1234 produces a variety of shapes known to work with this test
    with random_seed(1234):
        shapes = <get 'random' shapes>
    # fill shapes with random data
    tensors = populate_tensors(shapes)
    <perform test>

Let's say now that the test is seen to fail sporadically, and test seed 987654321 exposes the problem. The methodology of this PR should still work:

MXNET_TEST_SEED=987654321 nosetests --verbose -s test_stuff_module.py:test_stuff

will repro the failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for elaborating, this sounds very reasonable.

I just imagined a few nested blocks with random and hardcoded seeds and in my mind, your process seems to add up. None the less, could you please create a few unit tests to ensure that this behaviour works as expected and preserves a consistent sequence of number even if multiple blocks are nested, some seeds are hardcoded and so on? I think it's reasonable to validate our assumptions in case future changes get introduced.

Please also make sure to test the GPU context intensively, as we've experienced (known) issues on multi-GPU systems.

def test_sequential_warning():
with warnings.catch_warnings(record=True) as w:
# The following line permits the test to pass if run multiple times
warnings.simplefilter('always')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case would a test run multiple times and why shall it fail in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR provides an easy way to run a test multiple times under different seeds. I found that this particular test would fail under those circumstances without this fix, when invoked as:

MXNET_TEST_COUNT=100 nosetests --verbose tests/python/unittest/test_gluon.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sounds like a valid failure that should be resolved. Imagine a user calling the same method in a server and failing after some time because we might have some memory leak or something else that causes this error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior of 'warnings' is to collapse identical messages. This test purposefully does something that it thinks should generate a message. For the 'MXNET_TEST_COUNT=100' example above, it would not. Thus, the test should ensure, independent of the global policy, that identical messages are not squashed. Try the following:

python
Python 2.7.12 (default, Nov 20 2017, 18:23:56) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> import warnings
>>> from mxnet import gluon
>>> warnings.simplefilter('always')
>>> b = gluon.nn.Sequential()
>>> b.add(gluon.nn.Dense(20))
>>> b.hybridize()
/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/gluon/nn/basic_layers.py:83: UserWarning: All children of this Sequential layer are HybridBlocks. Consider using HybridSequential for the best performance.
  warnings.warn('All children of this Sequential layer are HybridBlocks. Consider ' \
>>> b.hybridize()
/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/gluon/nn/basic_layers.py:83: UserWarning: All children of this Sequential layer are HybridBlocks. Consider using HybridSequential for the best performance.
  warnings.warn('All children of this Sequential layer are HybridBlocks. Consider ' \
>>> 

If you start over and repeat this sequence, but leave out the 'warnings.simplefilter('always')' line, then the final 'b.hybridize()' line will output nothing.
I invite you to pursue making 'warnings.simplefilter('always')' part of all gluon usage, but independent of that, I think this test should leave this line in, since that's what the test needs to pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, thanks for the elaboration.

@@ -360,8 +373,12 @@ def backward(self, dm, dn):

assert_almost_equal(x.grad.asnumpy(), dx1)
assert_almost_equal(y.grad.asnumpy(), dy1)
# atol = 1e-6
# assert_almost_equal(x.grad.asnumpy(), dx1, atol=atol)
# assert_almost_equal(y.grad.asnumpy(), dy1, atol=atol)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Restore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. My error to leave this in.

data = mx.symbol.Variable('data')
out = mx.symbol.L2Normalization(data=data, mode=mode, eps=norm_eps)
# TODO(szha): Seeding this masks failures. We need to do a deep dive for failures without this seed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove the TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved both the TODO and the seed-setting to the beginning of the test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks

@marcoabreu
Copy link
Contributor

======================================================================

FAIL: test_operator.test_rcbrt_op

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\common.py", line 152, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\test_operator.py", line 3807, in test_rcbrt_op

    check_numeric_gradient(test, [data_tmp])

  File "C:\jenkins_slave\workspace\ut-python-cpu\pkg_vc14_cpu\python\mxnet\test_utils.py", line 914, in check_numeric_gradient

    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))

  File "C:\jenkins_slave\workspace\ut-python-cpu\pkg_vc14_cpu\python\mxnet\test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.311829 exceeds tolerance rtol=0.010000, atol=0.000000.  Location of maximum error:(2, 1), a=-84.507408, b=-83.413170

 NUMERICAL_data: array([[ -1.68704987,  -0.4118681 ,  -0.0692606 ,  -0.00718236],

       [ -0.02065301,  -0.03251433,  -0.16596913,  -0.01351535],

       [ -0.37240982, -84.50740814,  -0.02904981,  -0.08755922]], dtype=float32)

 BACKWARD_data: array([[ -1.68707681,  -0.41187629,  -0.06918965,  -0.00718444],

       [ -0.02061475,  -0.03253891,  -0.16597052,  -0.01352422],

       [ -0.3723917 , -83.41316986,  -0.02905715,  -0.08757117]], dtype=float32)

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=788174893 to reproduce.

--------------------- >> end captured logging << ---------------------

Awesome!

@marcoabreu
Copy link
Contributor

Would it be possible to write a test for this feature? Especially around the with random_seed():

@marcoabreu
Copy link
Contributor

I'd like to see this feature and the stated assumptions being tested, but otherwise it LGTM.

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


# To repro, comment-out skip line, then execute:
# MXNET_TEST_SEED=990952066 nosetests --verbose tests/python/unittest/test_operator.py:test_dropout
@unittest.skip("test fails with seed 990952066: 0 output seen with dropout ratio=0. Not yet tracked.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the test? Dropout is quite important and we shouldn't disable it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is better solved by someone closer to the modeling community. I'll explain what I know as an issue and put the test on a fixed seed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue #9816.

@marcoabreu
Copy link
Contributor

Approved for merge. Please resolve the test failure if possible.

@marcoabreu marcoabreu merged commit f33591f into apache:master Feb 18, 2018
@marcoabreu
Copy link
Contributor

@DickJC123 would you mind creating another PR which adds documentation for these features?

Good places could be the readme in the tests dir or maybe even somewhere on the website. I'll leave it up to you.

@DickJC123
Copy link
Contributor Author

DickJC123 commented Feb 18, 2018 via email

rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* Adds with_seed() decorator to unittests.  Will fail CI due to demo of known failing seeds.

* Fix failing tests, remove hard-coded bad seeds.

* Adds with_seed() decorator to gpu tests.  Will fail CI due to demo of known failing seeds.

* Fix failing test, add test of 'with random_seed()'

* Add with_seed() to test_gluon_model_zoo_gpy.py

* Increase atol for test_adam

* Added more 'with random_seed()' testing.  Hardcode bad test_training seed.

* test_training put back on random seeding

* Switched test_dropout to fixed seed, created issue

* Hardcoding seed that caused a test_rsp_push_pull CI core-dump

* test_rsp_push_pull CI core-dump not reproduced, so unsetting seed
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* Adds with_seed() decorator to unittests.  Will fail CI due to demo of known failing seeds.

* Fix failing tests, remove hard-coded bad seeds.

* Adds with_seed() decorator to gpu tests.  Will fail CI due to demo of known failing seeds.

* Fix failing test, add test of 'with random_seed()'

* Add with_seed() to test_gluon_model_zoo_gpy.py

* Increase atol for test_adam

* Added more 'with random_seed()' testing.  Hardcode bad test_training seed.

* test_training put back on random seeding

* Switched test_dropout to fixed seed, created issue

* Hardcoding seed that caused a test_rsp_push_pull CI core-dump

* test_rsp_push_pull CI core-dump not reproduced, so unsetting seed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants