Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add support for fast variable-length LSTM #14208

Merged
merged 36 commits into from
May 7, 2019

Conversation

stephenrawls
Copy link
Contributor

@stephenrawls stephenrawls commented Feb 19, 2019

Description

This is very much a work in progress (WIP).

I am trying to add support for variable length LSTMs in MxNet, and need help with someone to help me complete the PR.

This PR currently adds support for variable length sequences using the built-in support of that by cuDNN. It currently does not add CPU support but probably should.

I had to make a few changes to get it working that definitely need to change before we think about merging:

  1. I changed the way gluon handles kwargs to model blocks. Originally I did this because I thought I needed to pass in use_sequence_length=True, sequence_length=sequence_length the way the current SequenceLast/SequenceReverse/etc operators do. Now I think I probably don't need this since I moved the use_sequence_length parameter to the constructor. (Although perhaps we don't need this at all? it seems clunky to me, but I was trying to follow existing conventions).

  2. Currently I hard-coded the F.RNN call in rnn_layer.py to look like: .RNN(inputs, params, states[0], states[1], sequence_length, ... because this was giving me python syntax errors: .RNN(inputs, params, *states, sequence_length,...

  3. optimizer.py had a bunch of imports that were failing for me (e.g. multi_sgd_update). Are they not failing for everyone else? Not sure why they just failed for me, but I had to remove to get my test script to run.

  4. I had to modify imperative_utils.h to temporarily remove the check that all ndarrays are on the same context, because cudnn expects the sequence length array to live on the CPu.

Still need to add unit tests to test this feature. currently I am using the following test script:
https://gist.github.com/stephenrawls/8a4f4cb941f0c3f32bf5a8b3b12d32c0

Unfortunately it looks like the test I have is failing. Not sure why might need help from cudnn guys to help track down.

I have tested with cudnn v7.4.2.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@ptrendx
Copy link
Member

ptrendx commented Feb 19, 2019

3\. optimizer.py had a bunch of imports that were failing for me (e.g. `multi_sgd_update`). Are they not failing for everyone else? Not sure why they just failed for me, but I had to remove to get my test script to run.

This is very strange - are you sure you are importing the right version of .so? Those functions are not present in 1.3.1 or 1.4.0 but should be present on master.

4\. I had to modify imperative_utils.h to temporarily remove the check that all ndarrays are on the same context, because cudnn expects the sequence length array to live on the CPu.

This is going to be a bigger issue since I don't think NNVM currently allows you to do inputs on different contexts. I think @sandeep-krishnamurthy was going to look at it at some point for handling of nvJPEG.

Copy link
Contributor

@drivanov drivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lines are in wrong order and it leads to compilation problems.
The lines, currently ordered as

          << inputs[i]->ctx();
    */
    }

need to be replaced by

          << inputs[i]->ctx();
       }
      */

@stephenrawls
Copy link
Contributor Author

Sorry will fix shortly. Was a snafu when trying to clean up my code before pushing to GitHub. Thanks for the catch!

@stephenrawls
Copy link
Contributor Author

@szha maybe this deserves another PR but I also noticed that cuDNN supports both sequence-major and batch-major sequences:

From: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnRNNForwardInferenceEx

With unpacked layout, both sequence major (i.e. time major) and batch major are supported

Currently gluon is supporting batch-major sequences by calling swapaxes:
/~https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/rnn/rnn_layer.py#L241-L242

This ends up calling mshadow's swapaxis function, which I assume must be doing a copy (?) but to be honest I couldn't totally follow what the code is doing at a quick glance:
/~https://github.com/dmlc/mshadow/blob/3dc80815d965b56b9a975dc27229361955bf66fe/mshadow/extension/swapaxis.h

If it is doing a copy for swapaxes, it would be preferable for Gluon to just let cudnn handle the batch-major layout directly w/o calling swapaxes.

@szha
Copy link
Member

szha commented Feb 20, 2019

@stephenrawls swapaxis is doing an actual transpose

@TaoLv
Copy link
Member

TaoLv commented Feb 20, 2019

@stephenrawls Still cannot get the whole picture of this PR. Can you post a proposal for this task with below information? Then we can think about how to support it on CPU side. Thank you.

  • What does the front end API look like? What's the typical usage of the API?
  • What's done in Gluon blocks/hybridblocks and sym.RNN to support that?
  • What's done inside cudnn?
  • From the title, only LSTM is supported?

@stephenrawls
Copy link
Contributor Author

Hi @TaoLv ,

Basically my motivation is: we want to use stacked bidirectional LSTM's with variable sequence length. Currently this is slow in MxNet because to do it right you either have to (1) Use the LSTMCell and unroll which doesn't take advantage of cuDNN on the GPU; (2) Use cuDNN one layer at a time, and do a lot of reversing of your output to pass into the backward-direction lstm, so you can be sure that padding doesn't effect the result.

This all seemed a bit silly since cuDNN directly provides support for variable length stacked bidirectional LSTMs efficiently.

Beyond my direct use case, for the community I guess it would probably be nice if:

  • All RNN types are supported. (This happens already via cuDNN even though I am focussed on LSTM)
  • Both CPU and GPU are supported

I guess the API I have is currently like this:

lstm = mx.gluon.LSTM(..normal params..., use_sequence_length=True)
lstm(input, sequence_length)

I am not tied to this api however. For example if we want to not have a use_sequence_length param in initialization then I'm happy to remove it. I only put it there in the first place to be similar to existing Sequence* family of operators.

As far as what changes need to happen on the symbol side, I am less sure, I primarily use gluon. Happy for folks to offer suggestions.

Thanks,
Stephen

@stephenrawls
Copy link
Contributor Author

P.S. There was a GitHub issue for this last year but it sort of morphed into other things and this particular piece didn't get accomplished;

#9543

@drivanov
Copy link
Contributor

@stephenrawls: With following simple changes in your test_varlength_lstm.py:

ctx = mx.gpu()
X = mx.nd.random.randn(timesteps, batch_size, nchannels).as_in_context(ctx)
valid_lengths = mx.nd.array([10,10,10,7,6]).as_in_context(ctx) # [10,10,10,10,6] seems to work fine

you don't need to modify imperative_utils.h

@stephenrawls
Copy link
Contributor Author

@drivanov -- yes if I move the sequence length array to the gpu then I don't have to change the logic the verifies everything is in the same context. The reason I am keeping the sequence length array on the cpu even though the data is on the gpu is because the cudnn api is that the sequence length array is on cpu, i.e. host memory.

If I pass in the sequence length array on GPU memory then I would have to do a GPU -> CPU copy inside of the lstm operator, right?

@stephenrawls
Copy link
Contributor Author

Latest push from me now gets variable length lstms working correctly in cudnn. Turns out that in addition to setting the layout to UNPACKED when creating input descriptors, we also need to call a function to enable unpacked padded LSTMs prior to calculating the operator's workspace size.

Now the forward pass works correctly and my rudimentary tests pass.

Still need help figuring out what to do with passing sequence_length input on the CPU. Any ideas?

@roywei
Copy link
Member

roywei commented Feb 21, 2019

@mxnet-label-bot add[RNN, Backend, Gluon, pr-work-in-progress]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet Gluon pr-work-in-progress PR is still work in progress RNN labels Feb 21, 2019
@szha
Copy link
Member

szha commented Feb 21, 2019

@stephenrawls for now you can do a CHECK(!use_seq_len) in CPU kernel. @pengzhao-intel @TaoLv will see how to add it for CPU kernels.

@stephenrawls
Copy link
Contributor Author

@szha okay I will do that and update PR shortly.

My question though (sorry for not being clear) is about the GPU kernel. Currently the cuDNN api requires the sequence_length array to be in host memory, so I am passing the sequence_length array on the CPU even though the data array is on the GPU.

Currently the imperative_utils.h requires that all ndarrays passed to an operator are on the same context. To get around this I just commented out that check so I could at least test that my changes were working.

My question is--(1) What's the right way to modify the context check in imperative_utils.h? (2) Is there something similar that happens on the symbol api side that would have to change?

The alternative is to just pass the sequence_length array in on the GPU, but this would require an extra GPU -> CPU data transfer inside the operator that I would prefer to avoid.

@szha
Copy link
Member

szha commented Feb 21, 2019

@stephenrawls the length array is generally small so I think it's fine to do a copy inside the operator. @reminisce @junrushao1994 are looking into how to refurbish the backend so that such cases can be better supported.

@ptrendx
Copy link
Member

ptrendx commented Feb 21, 2019

To answer your question (2) Is there something similar that happens on the symbol api side that would have to change?: yes, place device graph pass will just error out saying that it is impossible to mix devices.

I had a similar problem in aggregate SGD case, where I wanted to put learning rates and weight decay rates as CPU side NDArray. I ended up passing them as a Tuple argument to the op in Python.

@stephenrawls
Copy link
Contributor Author

@szha Okay I re-worked the code to take in sequence_length on the GPU and copy to the CPU inside the kernel. I added a unit test to test the functionality and it appears to be working.

For some reason I am getting errors when trying to pass an int32 sequence_length array. Not sure why, because I changed infer_shape() to allow it. Can look into it later but if anyone has suggestions please let me know.

@ptrendx -- I figured out the weird import error and reverted those changes to optimizer.py in my PR. Also as mentioned above, since I am now doing an extra GPU->CPU copy in the kernel code, I have also reverted the change to imperative_utils.h.

@szha -- I think all that is remaining in this PR is to add the CHECK(!use_seq_len) for the CPU case, to figure out why int32 sequence_length is not working, and any cleanup / modifications you think are appropriate. Also--I added a unit test for gluon, but should I also add one for the Symbol api code, or is it sufficient to test this functionality in Gluon?

@pinaraws
Copy link

@szha Could you please respond to questions posted by @stephenrawls ?

@piyushghai
Copy link
Contributor

@stephenrawls Can you rebase with the latest master to resolve the merge conflicts ?

@stephenrawls
Copy link
Contributor Author

FYI I am looking at this. I have my changes rebased against the latest changes from master now, and updated to fit the new pattern. I just need to build and test. Since things have changed significantly since before I just need to re-setup my build environment, should have an update in a day or so.

@stephenrawls
Copy link
Contributor Author

@szha finally got around to rebasing my PR and pushing the update.

The centos-gpu build is failing. The new var-length unit test is failing with this error:

ERROR: test_gluon_gpu.test_layer_bidirectional_varseqlength
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 177, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/../unittest/common.py", line 112, in test_new
    assert_raises((MXNetError, RuntimeError), orig_test, *args, **kwargs)
  File "/usr/lib64/python3.6/unittest/case.py", line 750, in assertRaises
    return context.handle('assertRaises', args, kwargs)
  File "/usr/lib64/python3.6/unittest/case.py", line 195, in handle
    callable_obj(*args, **kwargs)
  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 290, in test_layer_bidirectional_varseqlength
    check_layer_bidirectional_varseqlen(7, 5)
  File "/work/mxnet/tests/python/gpu/test_gluon_gpu.py", line 267, in check_layer_bidirectional_varseqlen
    net_output = net(data, sequence_length).asnumpy()
  File "/work/mxnet/python/mxnet/gluon/block.py", line 540, in __call__
    out = self.forward(*args)
  File "/work/mxnet/python/mxnet/gluon/block.py", line 912, in forward
    self._deferred_infer_shape(x, *args)
  File "/work/mxnet/python/mxnet/gluon/block.py", line 793, in _deferred_infer_shape
    raise ValueError(error_msg)
ValueError: Deferred initialization failed because shape cannot be inferred. Error in operator lstm_rnn0: Shape inconsistent, Provided = [10], inferred shape=(2,10,7)
-------------------- >> begin captured stdout << ---------------------
infer_shape error. Arguments:
  data0: (11, 10, 5)
  data1: (10,)

I'm not sure why yet, because when I run the unit test code on my own machine with my own build it works.

Also the unix-gpu build is failing for the unit test test_operator.test_custom_op_exc which seems unrelated to my PR.

I'll see if I can figure out why the centos-gpu build is failing (even though it works for me on my own machine) tomorrow. But if you have any ideas / advice please let me know.

@stephenrawls
Copy link
Contributor Author

rebased against mainline (requiring a force-push), which now gets all unit tests passing.

src/operator/rnn-inl.h Outdated Show resolved Hide resolved

# Only compare the valid sections for each batch entry
for b in range(batch_size):
assert_allclose(net_output[:sequence_length_np[b], b], ref_net_output[:sequence_length_np[b], b])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem to test if the length-based masking is working properly, because the reference implementation also relies on sequence length feature. consider using LSTMCell as reference instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference net is not using the sequence length feature of cudnn, because use_sequence_length defaults to false.

The reference net is actually manually implementing the variable sequence length support by using two LSTMs for forward/backward direction and manually handling reversing them and concatenating the forward/backward directions. That is, it is doing it a slower way than via cudnn, but in a way we know should produce correct results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I mistook the use_sequence_length flag to be in rnn op. Still, whether the returned state is of the right step or not is not tested, which is also an important aspect of variable length RNN support. It may be hard to test it using RNN layer as reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. I am only testing the returned output, not the return state.

I guess I could just loop over the batch elements one-by-one, passing them each in turn to the reference lstm. That way each input is correctly sized and I can easily grab the right return state.

Should I do that now or in the follow-on PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine for it to be in a follow-up PR.

@stephenrawls
Copy link
Contributor Author

All unit tests except for the following passed:

AssertionError: Diff. between MXNet & TensorRT accuracy too high:
           MXNet = 99.150000, TensorRT = 99.140000

So I pushed a new minor change to trigger unit tests to run again and get this flaky unit test to pass.

@szha szha merged commit 527573e into apache:master May 7, 2019
access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request May 14, 2019
* initial commit for variable length sequence support w/ cudnn

* removing check about all vectors on same context (need to add back in)

* fixing commented-out code to actually coment-out what I wanted

* fixing cudnn layout type to be unpacked in var-length seq case

* looks like param.batch_size_ etc weren't previousy getting set in cudnn operator code. still doesn't fix cudnn error though

* must call cudnnSetRNNPaddingMode() to enable unpacked padded sequences

* cleaning up & adding unit tests

* cleanign up

* cleanign up

* removing stringstream and checking for cudnn >= 7.2

* fixing whitespace formatting errors; adding ifdef version guard for cudnn padding

* fixing a few syntax errors

* changing order of arguments in hybird_forward for backward compatibility

* more build validation fixes

* using emplace_back to make linter happy

* adding import of mxnet.ndarray

* switching order of sequence_length in hybrid_forward again

* adding __call__ override to rnn layer to handle optional sequence_length argument

* whoops swapped order of args in one place but not the other

* changing type() to isinstance() to make linter happy

* changing lstm var seq length call to explciitly name sequence_length parameter

* fixing bad scope of if-statement checking state outputs

* resolving reviewer comments

* making linter happy by putting var definitions in appropriate ifdef

* fixing linter again

* fixing whitespace issues with linter

* fixing whitespace issues with linter

* fixing some typos that emerged fixing linter

* linter

* fixing more whitespace issues

* only access kTempSpace if on gpu

* removing tabs that slipped in

* fixing too-long line

* changing ifdef guard to be more generic

* reverting change so whitespace stays same w/ master

* adding todo comment
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* initial commit for variable length sequence support w/ cudnn

* removing check about all vectors on same context (need to add back in)

* fixing commented-out code to actually coment-out what I wanted

* fixing cudnn layout type to be unpacked in var-length seq case

* looks like param.batch_size_ etc weren't previousy getting set in cudnn operator code. still doesn't fix cudnn error though

* must call cudnnSetRNNPaddingMode() to enable unpacked padded sequences

* cleaning up & adding unit tests

* cleanign up

* cleanign up

* removing stringstream and checking for cudnn >= 7.2

* fixing whitespace formatting errors; adding ifdef version guard for cudnn padding

* fixing a few syntax errors

* changing order of arguments in hybird_forward for backward compatibility

* more build validation fixes

* using emplace_back to make linter happy

* adding import of mxnet.ndarray

* switching order of sequence_length in hybrid_forward again

* adding __call__ override to rnn layer to handle optional sequence_length argument

* whoops swapped order of args in one place but not the other

* changing type() to isinstance() to make linter happy

* changing lstm var seq length call to explciitly name sequence_length parameter

* fixing bad scope of if-statement checking state outputs

* resolving reviewer comments

* making linter happy by putting var definitions in appropriate ifdef

* fixing linter again

* fixing whitespace issues with linter

* fixing whitespace issues with linter

* fixing some typos that emerged fixing linter

* linter

* fixing more whitespace issues

* only access kTempSpace if on gpu

* removing tabs that slipped in

* fixing too-long line

* changing ifdef guard to be more generic

* reverting change so whitespace stays same w/ master

* adding todo comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Gluon pr-work-in-progress PR is still work in progress RNN
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants