CI test randomness 3 #9791

DickJC123 · 2018-02-14T01:48:01Z

Description

This is a rebasing and partial squashing of the stale "ci test randomness2" PR #8526 . It's based on the premise that the CI should test the framework the way the users do- with lots of different data. Having tests run with random data has been problematic though, with flaky tests being a constant annoyance. This PR supplies the tools to track down the flakiness by explicitly seeding all tests that use random data, and reporting those seeds if the test fails. The first commit of the PR rolls out the with_seed() decorator to tests under the tests/python/unittests directory with some hard-coded seeds designed to expose existing flakiness. Follow-up commits will supply known fixes. Further commits will touch the tests/python/gpu directory tests.

From PR #8526:

This PR introduces a new simple level of control over unittest random seeds while providing inter-test random number generator (RNG) isolation. This PR is an improved replacement to the pending PR #8313. The improvements over that PR are:

A unittest that fails via an exception will have its seed reported in the test log. Reproducing the failure with the same-seeded data is simple and immediate.
The mx.random and python random seeds are also set (identically to np.random) giving deterministic behavior and test isolation for mxnet cpu and gpu RNG's.
A unittest failure via a core-dump can also be reproduced after the module test is re-run with debugging enabled.

To provide this functionality, a custom decorator "@with_seed()" was created. This was considered more powerful than the nosetests "@with_setup()" facility, and less disruptive than changing all tests to become methods of a nosetests test class. The proposed new approach is demonstrated on a simple "test_module.py" test file of three tests. Assuming that the second test needs a set seed for robustness, the file might currently appear as:

def test_op1():
    <op1 test>

def test_op2():
    np.random.seed(1234)
    <op2 test>

def test_op3():
    <op3 test>

Even though test_op3() is OK with nondeterministic data, it will have only a single dataset because it is run after test_op2, which sets the seed. Also, if test_op1() were to fail, there would be no way to reproduce the failure, except for running the test individually to produce a new and hopefully similar failure.

With the proposed approach, the test file becomes:

from common import setup_module, with_seed

@with_seed()
def test_op1():
    <op1 test>

@with_seed(1234)
def test_op2():
    <op2 test>

@with_seed()
def test_op3():
    <op3 test>

By importing unittests/common.py, the seeds of the numpy and mxnet RNGs are set initially to a random "module seed" that is output to the log file. The initial RNGs can be thought of as module-level RNGs that are isolated from the individual tests and that provide the string of "test seeds" that determine the behavior of each test's RNGs. The "@with_seed()" test function decorator requests a test seed from the module RNG, sets the numpy, python and mxnet seeds appropriately and runs the test. Should the test fail, the seed for that test is output to the log file. Pass or fail, the decorator reinstates the module RNG state before the next test's decorator is executed, effectively isolating the tests. Debugging a failing test_op3 in the example would proceed as follows:

$ nosetests --verbose -s test_module.py
[INFO] Setting module np/mx random seeds, use MXNET_MODULE_SEED=3444154063 to reproduce.
test_module.test_op1 ... ok
test_module.test_op2 ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
ok
test_module.test_op3 ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2096230603 to reproduce.
FAIL
======================================================================
FAIL: test_module.test_op3
----------------------------------------------------------------------
Traceback (most recent call last):
<stack trace appears here>
-------------------- >> begin captured logging << --------------------
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=2096230603 to reproduce.
--------------------- >> end captured logging << ---------------------
----------------------------------------------------------------------
Ran 3 tests in 1.354s
FAILED (failures=1)

Because test_op3 failed, its seed appeared in the log file. Also, the test_op2 seed was displayed as a reminder that that test needs more work before it is robust enough for random data. The command to reproduce the problem is produced simply by cutting and pasting from the log:

$ MXNET_TEST_SEED=2096230603 nostests --verbose -s test_module.py:test_op3

If test_op3 instead dumped core, the test seed would not be initially apparent. Assuming the core-dump is repeatable based on the data, the module would first be re-run with the command:

$ MXNET_MODULE_SEED=3444154063 nostests --logging-level=DEBUG --verbose -s test_module.py

The log would now include the test seeds for all tests before they are run, so the test could then be run in isolation as before with MXNET_TEST_SEED=2096230603.

Let's assume that test_op3 was altered by increasing a tolerance. How robust is the test now? This can be explored by repeating the test many times as in:

$ MXNET_TEST_COUNT=10000 nostests --logging-level=DEBUG --verbose -s test_module.py:test_op3

Finally, this PR adds the @with_seed() decorator for all tests in modules that use random numbers. Also, it includes many specific test robustness fixes that were exposed once this new methodology was adopted internally by the author.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

…known failing seeds.

marcoabreu · 2018-02-14T10:34:09Z

Very nice feature, thanks a lot for putting so much effort into this!

marcoabreu · 2018-02-14T10:39:23Z

tests/python/unittest/common.py

+    """
+
+    try:
+        next_seed = np.random.randint(0, np.iinfo(np.int32).max)


Could you elaborate how this restores the pre-block state? To me this looks like it will always pick a new random seed after this block finishes. I rather expected that the old sequence generator is being restored. Imagine two sequences:
1: 100 -> 101 -> 102 -> ...
2: 200 -> 201 -> 202 -> ...

What I would expect is something along the lines of:

100 101 START: [random_seed(seq_2)] 200 201 202 END: [random_seed(seq_2)] 102 103 ...

You're asking about the "with random_seed(NNN)" construct (as opposed to the with_seed() decorator. One issue faced here is that the mxnet rng state is not save-and-restorable, so we can't achieve exactly what you were expecting. What's promised is this:

Upon conclusion of the block, the rng's are returned to a state that is a function of their pre-block state, so any prior non-determinism is preserved.

So 'with random_seed(NNN)' forces determinism on the random values seen inside its block, without changing the determinism/non-determinism of the values seen outside the block. Likewise, 'with random_seed()' (so no numeric arg) forces non-determinism on the random values seen inside its block, without changing the determinism/non-determinism of the values seen inside the block.

Let's take a use-case that explores this feature: a test that initially gets both random shapes and random data to populate tensors of those shapes. Let's say the test is flaky and it's determined that it's best to work with a known set of shapes. The test could be then coded as:

@with_seed() def test_stuff(): # seed 1234 produces a variety of shapes known to work with this test with random_seed(1234): shapes = <get 'random' shapes> # fill shapes with random data tensors = populate_tensors(shapes) <perform test>

Let's say now that the test is seen to fail sporadically, and test seed 987654321 exposes the problem. The methodology of this PR should still work:

MXNET_TEST_SEED=987654321 nosetests --verbose -s test_stuff_module.py:test_stuff

will repro the failure.

Thanks for elaborating, this sounds very reasonable.

I just imagined a few nested blocks with random and hardcoded seeds and in my mind, your process seems to add up. None the less, could you please create a few unit tests to ensure that this behaviour works as expected and preserves a consistent sequence of number even if multiple blocks are nested, some seeds are hardcoded and so on? I think it's reasonable to validate our assumptions in case future changes get introduced.

Please also make sure to test the GPU context intensively, as we've experienced (known) issues on multi-GPU systems.

marcoabreu · 2018-02-14T10:45:50Z

tests/python/unittest/test_gluon.py

 def test_sequential_warning():
    with warnings.catch_warnings(record=True) as w:
+        # The following line permits the test to pass if run multiple times
+        warnings.simplefilter('always')


In what case would a test run multiple times and why shall it fail in that case?

This PR provides an easy way to run a test multiple times under different seeds. I found that this particular test would fail under those circumstances without this fix, when invoked as:

MXNET_TEST_COUNT=100 nosetests --verbose tests/python/unittest/test_gluon.py

I think this sounds like a valid failure that should be resolved. Imagine a user calling the same method in a server and failing after some time because we might have some memory leak or something else that causes this error.

The default behavior of 'warnings' is to collapse identical messages. This test purposefully does something that it thinks should generate a message. For the 'MXNET_TEST_COUNT=100' example above, it would not. Thus, the test should ensure, independent of the global policy, that identical messages are not squashed. Try the following:

python Python 2.7.12 (default, Nov 20 2017, 18:23:56) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import mxnet as mx >>> import warnings >>> from mxnet import gluon >>> warnings.simplefilter('always') >>> b = gluon.nn.Sequential() >>> b.add(gluon.nn.Dense(20)) >>> b.hybridize() /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/gluon/nn/basic_layers.py:83: UserWarning: All children of this Sequential layer are HybridBlocks. Consider using HybridSequential for the best performance. warnings.warn('All children of this Sequential layer are HybridBlocks. Consider ' \ >>> b.hybridize() /home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/gluon/nn/basic_layers.py:83: UserWarning: All children of this Sequential layer are HybridBlocks. Consider using HybridSequential for the best performance. warnings.warn('All children of this Sequential layer are HybridBlocks. Consider ' \ >>>

If you start over and repeat this sequence, but leave out the 'warnings.simplefilter('always')' line, then the final 'b.hybridize()' line will output nothing.
I invite you to pursue making 'warnings.simplefilter('always')' part of all gluon usage, but independent of that, I think this test should leave this line in, since that's what the test needs to pass.

I understand, thanks for the elaboration.

marcoabreu · 2018-02-14T10:46:48Z

tests/python/unittest/test_autograd.py

@@ -360,8 +373,12 @@ def backward(self, dm, dn):

    assert_almost_equal(x.grad.asnumpy(), dx1)
    assert_almost_equal(y.grad.asnumpy(), dy1)
+#    atol = 1e-6
+#    assert_almost_equal(x.grad.asnumpy(), dx1, atol=atol)
+#    assert_almost_equal(y.grad.asnumpy(), dy1, atol=atol)


TODO: Restore?

Will do. My error to leave this in.

marcoabreu · 2018-02-14T10:48:48Z

tests/python/unittest/test_operator.py

    data = mx.symbol.Variable('data')
    out = mx.symbol.L2Normalization(data=data, mode=mode, eps=norm_eps)
-    # TODO(szha): Seeding this masks failures. We need to do a deep dive for failures without this seed.


Don't remove the TODO

I moved both the TODO and the seed-setting to the beginning of the test

Ah I see, thanks

marcoabreu · 2018-02-14T12:35:44Z

======================================================================

FAIL: test_operator.test_rcbrt_op

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\common.py", line 152, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-cpu\tests\python\unittest\test_operator.py", line 3807, in test_rcbrt_op

    check_numeric_gradient(test, [data_tmp])

  File "C:\jenkins_slave\workspace\ut-python-cpu\pkg_vc14_cpu\python\mxnet\test_utils.py", line 914, in check_numeric_gradient

    ("NUMERICAL_%s"%name, "BACKWARD_%s"%name))

  File "C:\jenkins_slave\workspace\ut-python-cpu\pkg_vc14_cpu\python\mxnet\test_utils.py", line 493, in assert_almost_equal

    raise AssertionError(msg)

AssertionError: 

Items are not equal:

Error 1.311829 exceeds tolerance rtol=0.010000, atol=0.000000.  Location of maximum error:(2, 1), a=-84.507408, b=-83.413170

 NUMERICAL_data: array([[ -1.68704987,  -0.4118681 ,  -0.0692606 ,  -0.00718236],

       [ -0.02065301,  -0.03251433,  -0.16596913,  -0.01351535],

       [ -0.37240982, -84.50740814,  -0.02904981,  -0.08755922]], dtype=float32)

 BACKWARD_data: array([[ -1.68707681,  -0.41187629,  -0.06918965,  -0.00718444],

       [ -0.02061475,  -0.03253891,  -0.16597052,  -0.01352422],

       [ -0.3723917 , -83.41316986,  -0.02905715,  -0.08757117]], dtype=float32)

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=788174893 to reproduce.

--------------------- >> end captured logging << ---------------------

Awesome!

marcoabreu · 2018-02-14T12:37:26Z

Would it be possible to write a test for this feature? Especially around the with random_seed():

…known failing seeds.

marcoabreu · 2018-02-16T10:57:15Z

I'd like to see this feature and the stated assumptions being tested, but otherwise it LGTM.

…eed.

marcoabreu

LGTM

marcoabreu · 2018-02-16T23:53:44Z

tests/python/unittest/test_operator.py

-
+# To repro, comment-out skip line, then execute:
+# MXNET_TEST_SEED=990952066 nosetests --verbose tests/python/unittest/test_operator.py:test_dropout
+@unittest.skip("test fails with seed 990952066: 0 output seen with dropout ratio=0. Not yet tracked.")


Could you fix the test? Dropout is quite important and we shouldn't disable it.

I think the problem is better solved by someone closer to the modeling community. I'll explain what I know as an issue and put the test on a fixed seed.

Created issue #9816.

marcoabreu · 2018-02-17T09:31:20Z

Approved for merge. Please resolve the test failure if possible.

marcoabreu · 2018-02-18T17:07:44Z

@DickJC123 would you mind creating another PR which adds documentation for these features?

Good places could be the readme in the tests dir or maybe even somewhere on the website. I'll leave it up to you.

DickJC123 · 2018-02-18T17:13:11Z

Sure.

…

On Feb 18, 2018, at 9:08 AM, Marco de Abreu ***@***.***> wrote: @DickJC123 would you mind creating another PR which adds documentation for these features? Good places could be the readme in the tests dir or maybe even somewhere on the website. I'll leave it up to you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

* Adds with_seed() decorator to unittests. Will fail CI due to demo of known failing seeds. * Fix failing tests, remove hard-coded bad seeds. * Adds with_seed() decorator to gpu tests. Will fail CI due to demo of known failing seeds. * Fix failing test, add test of 'with random_seed()' * Add with_seed() to test_gluon_model_zoo_gpy.py * Increase atol for test_adam * Added more 'with random_seed()' testing. Hardcode bad test_training seed. * test_training put back on random seeding * Switched test_dropout to fixed seed, created issue * Hardcoding seed that caused a test_rsp_push_pull CI core-dump * test_rsp_push_pull CI core-dump not reproduced, so unsetting seed

Adds with_seed() decorator to unittests. Will fail CI due to demo of …

fdc2873

…known failing seeds.

DickJC123 requested a review from szha as a code owner February 14, 2018 01:48

marcoabreu reviewed Feb 14, 2018

View reviewed changes

marcoabreu suggested changes Feb 14, 2018

View reviewed changes

DickJC123 added 6 commits February 14, 2018 18:47

Fix failing tests, remove hard-coded bad seeds.

40db78d

Adds with_seed() decorator to gpu tests. Will fail CI due to demo of …

ebc1420

…known failing seeds.

Fix failing test, add test of 'with random_seed()'

c6e818b

Merge branch 'fork_master' into ci_test_randomness3

9c6e3eb

Add with_seed() to test_gluon_model_zoo_gpy.py

fefbf0e

Increase atol for test_adam

d931502

Added more 'with random_seed()' testing. Hardcode bad test_training s…

2ea19a2

…eed.

marcoabreu approved these changes Feb 16, 2018

View reviewed changes

test_training put back on random seeding

daceaca

DickJC123 mentioned this pull request Feb 16, 2018

Unrepeatable test_gluon_model_zoo_gpu.py:test_training CI failures seen. #9812

Closed

marcoabreu suggested changes Feb 16, 2018

View reviewed changes

DickJC123 mentioned this pull request Feb 17, 2018

Dropout may mask values even when ratio=0.0 #9816

Closed

Switched test_dropout to fixed seed, created issue

0395fe3

marcoabreu approved these changes Feb 17, 2018

View reviewed changes

DickJC123 added 2 commits February 17, 2018 09:37

Hardcoding seed that caused a test_rsp_push_pull CI core-dump

f8a6a59

test_rsp_push_pull CI core-dump not reproduced, so unsetting seed

67ebd42

marcoabreu merged commit f33591f into apache:master Feb 18, 2018

marcoabreu mentioned this pull request Feb 23, 2018

Fix a race condition in converting data layouts in MKLDNN. #9862

Merged

5 tasks

DickJC123 mentioned this pull request May 1, 2018

[MXNET-346] Fix flaky test for hard_sigmoid #10759

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI test randomness 3 #9791

CI test randomness 3 #9791

DickJC123 commented Feb 14, 2018 •

edited

Loading

marcoabreu commented Feb 14, 2018

marcoabreu Feb 14, 2018

DickJC123 Feb 15, 2018

marcoabreu Feb 16, 2018

marcoabreu Feb 14, 2018

DickJC123 Feb 15, 2018

marcoabreu Feb 15, 2018

DickJC123 Feb 15, 2018

marcoabreu Feb 16, 2018

marcoabreu Feb 14, 2018

DickJC123 Feb 15, 2018

marcoabreu Feb 14, 2018

DickJC123 Feb 15, 2018

marcoabreu Feb 16, 2018

marcoabreu commented Feb 14, 2018

marcoabreu commented Feb 14, 2018

marcoabreu commented Feb 16, 2018

marcoabreu left a comment

marcoabreu Feb 16, 2018

DickJC123 Feb 17, 2018

DickJC123 Feb 17, 2018

marcoabreu commented Feb 17, 2018

marcoabreu commented Feb 18, 2018

DickJC123 commented Feb 18, 2018 via email

CI test randomness 3 #9791

CI test randomness 3 #9791

Conversation

DickJC123 commented Feb 14, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

marcoabreu commented Feb 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Feb 14, 2018

marcoabreu commented Feb 14, 2018

marcoabreu commented Feb 16, 2018

marcoabreu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Feb 17, 2018

marcoabreu commented Feb 18, 2018

DickJC123 commented Feb 18, 2018 via email

DickJC123 commented Feb 14, 2018 •

edited

Loading