-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add support for fast variable-length LSTM #14208
Conversation
This is very strange - are you sure you are importing the right version of .so? Those functions are not present in 1.3.1 or 1.4.0 but should be present on master.
This is going to be a bigger issue since I don't think NNVM currently allows you to do inputs on different contexts. I think @sandeep-krishnamurthy was going to look at it at some point for handling of nvJPEG. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lines are in wrong order and it leads to compilation problems.
The lines, currently ordered as
<< inputs[i]->ctx();
*/
}
need to be replaced by
<< inputs[i]->ctx();
}
*/
Sorry will fix shortly. Was a snafu when trying to clean up my code before pushing to GitHub. Thanks for the catch! |
@szha maybe this deserves another PR but I also noticed that cuDNN supports both sequence-major and batch-major sequences: From: https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#cudnnRNNForwardInferenceEx
Currently gluon is supporting batch-major sequences by calling swapaxes: This ends up calling mshadow's swapaxis function, which I assume must be doing a copy (?) but to be honest I couldn't totally follow what the code is doing at a quick glance: If it is doing a copy for swapaxes, it would be preferable for Gluon to just let cudnn handle the batch-major layout directly w/o calling swapaxes. |
@stephenrawls swapaxis is doing an actual transpose |
@stephenrawls Still cannot get the whole picture of this PR. Can you post a proposal for this task with below information? Then we can think about how to support it on CPU side. Thank you.
|
Hi @TaoLv , Basically my motivation is: we want to use stacked bidirectional LSTM's with variable sequence length. Currently this is slow in MxNet because to do it right you either have to (1) Use the LSTMCell and unroll which doesn't take advantage of cuDNN on the GPU; (2) Use cuDNN one layer at a time, and do a lot of reversing of your output to pass into the backward-direction lstm, so you can be sure that padding doesn't effect the result. This all seemed a bit silly since cuDNN directly provides support for variable length stacked bidirectional LSTMs efficiently. Beyond my direct use case, for the community I guess it would probably be nice if:
I guess the API I have is currently like this:
I am not tied to this api however. For example if we want to not have a As far as what changes need to happen on the symbol side, I am less sure, I primarily use gluon. Happy for folks to offer suggestions. Thanks, |
P.S. There was a GitHub issue for this last year but it sort of morphed into other things and this particular piece didn't get accomplished; |
@stephenrawls: With following simple changes in your
you don't need to modify |
@drivanov -- yes if I move the sequence length array to the gpu then I don't have to change the logic the verifies everything is in the same context. The reason I am keeping the sequence length array on the cpu even though the data is on the gpu is because the cudnn api is that the sequence length array is on cpu, i.e. host memory. If I pass in the sequence length array on GPU memory then I would have to do a GPU -> CPU copy inside of the lstm operator, right? |
Latest push from me now gets variable length lstms working correctly in cudnn. Turns out that in addition to setting the layout to UNPACKED when creating input descriptors, we also need to call a function to enable unpacked padded LSTMs prior to calculating the operator's workspace size. Now the forward pass works correctly and my rudimentary tests pass. Still need help figuring out what to do with passing sequence_length input on the CPU. Any ideas? |
@mxnet-label-bot add[RNN, Backend, Gluon, pr-work-in-progress] |
@stephenrawls for now you can do a |
@szha okay I will do that and update PR shortly. My question though (sorry for not being clear) is about the GPU kernel. Currently the cuDNN api requires the sequence_length array to be in host memory, so I am passing the sequence_length array on the CPU even though the data array is on the GPU. Currently the imperative_utils.h requires that all ndarrays passed to an operator are on the same context. To get around this I just commented out that check so I could at least test that my changes were working. My question is--(1) What's the right way to modify the context check in imperative_utils.h? (2) Is there something similar that happens on the symbol api side that would have to change? The alternative is to just pass the sequence_length array in on the GPU, but this would require an extra GPU -> CPU data transfer inside the operator that I would prefer to avoid. |
@stephenrawls the length array is generally small so I think it's fine to do a copy inside the operator. @reminisce @junrushao1994 are looking into how to refurbish the backend so that such cases can be better supported. |
To answer your question I had a similar problem in aggregate SGD case, where I wanted to put learning rates and weight decay rates as CPU side NDArray. I ended up passing them as a Tuple argument to the op in Python. |
@szha Okay I re-worked the code to take in sequence_length on the GPU and copy to the CPU inside the kernel. I added a unit test to test the functionality and it appears to be working. For some reason I am getting errors when trying to pass an int32 sequence_length array. Not sure why, because I changed infer_shape() to allow it. Can look into it later but if anyone has suggestions please let me know. @ptrendx -- I figured out the weird import error and reverted those changes to optimizer.py in my PR. Also as mentioned above, since I am now doing an extra GPU->CPU copy in the kernel code, I have also reverted the change to imperative_utils.h. @szha -- I think all that is remaining in this PR is to add the |
@szha Could you please respond to questions posted by @stephenrawls ? |
@stephenrawls Can you rebase with the latest master to resolve the merge conflicts ? |
FYI I am looking at this. I have my changes rebased against the latest changes from master now, and updated to fit the new pattern. I just need to build and test. Since things have changed significantly since before I just need to re-setup my build environment, should have an update in a day or so. |
8cbd124
to
0edb5c2
Compare
@szha finally got around to rebasing my PR and pushing the update. The centos-gpu build is failing. The new var-length unit test is failing with this error:
I'm not sure why yet, because when I run the unit test code on my own machine with my own build it works. Also the unix-gpu build is failing for the unit test I'll see if I can figure out why the centos-gpu build is failing (even though it works for me on my own machine) tomorrow. But if you have any ideas / advice please let me know. |
889148f
to
6b14793
Compare
rebased against mainline (requiring a force-push), which now gets all unit tests passing. |
|
||
# Only compare the valid sections for each batch entry | ||
for b in range(batch_size): | ||
assert_allclose(net_output[:sequence_length_np[b], b], ref_net_output[:sequence_length_np[b], b]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem to test if the length-based masking is working properly, because the reference implementation also relies on sequence length feature. consider using LSTMCell as reference instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference net is not using the sequence length feature of cudnn, because use_sequence_length defaults to false.
The reference net is actually manually implementing the variable sequence length support by using two LSTMs for forward/backward direction and manually handling reversing them and concatenating the forward/backward directions. That is, it is doing it a slower way than via cudnn, but in a way we know should produce correct results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I mistook the use_sequence_length flag to be in rnn op. Still, whether the returned state is of the right step or not is not tested, which is also an important aspect of variable length RNN support. It may be hard to test it using RNN layer as reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point. I am only testing the returned output, not the return state.
I guess I could just loop over the batch elements one-by-one, passing them each in turn to the reference lstm. That way each input is correctly sized and I can easily grab the right return state.
Should I do that now or in the follow-on PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine for it to be in a follow-up PR.
All unit tests except for the following passed:
So I pushed a new minor change to trigger unit tests to run again and get this flaky unit test to pass. |
* initial commit for variable length sequence support w/ cudnn * removing check about all vectors on same context (need to add back in) * fixing commented-out code to actually coment-out what I wanted * fixing cudnn layout type to be unpacked in var-length seq case * looks like param.batch_size_ etc weren't previousy getting set in cudnn operator code. still doesn't fix cudnn error though * must call cudnnSetRNNPaddingMode() to enable unpacked padded sequences * cleaning up & adding unit tests * cleanign up * cleanign up * removing stringstream and checking for cudnn >= 7.2 * fixing whitespace formatting errors; adding ifdef version guard for cudnn padding * fixing a few syntax errors * changing order of arguments in hybird_forward for backward compatibility * more build validation fixes * using emplace_back to make linter happy * adding import of mxnet.ndarray * switching order of sequence_length in hybrid_forward again * adding __call__ override to rnn layer to handle optional sequence_length argument * whoops swapped order of args in one place but not the other * changing type() to isinstance() to make linter happy * changing lstm var seq length call to explciitly name sequence_length parameter * fixing bad scope of if-statement checking state outputs * resolving reviewer comments * making linter happy by putting var definitions in appropriate ifdef * fixing linter again * fixing whitespace issues with linter * fixing whitespace issues with linter * fixing some typos that emerged fixing linter * linter * fixing more whitespace issues * only access kTempSpace if on gpu * removing tabs that slipped in * fixing too-long line * changing ifdef guard to be more generic * reverting change so whitespace stays same w/ master * adding todo comment
* initial commit for variable length sequence support w/ cudnn * removing check about all vectors on same context (need to add back in) * fixing commented-out code to actually coment-out what I wanted * fixing cudnn layout type to be unpacked in var-length seq case * looks like param.batch_size_ etc weren't previousy getting set in cudnn operator code. still doesn't fix cudnn error though * must call cudnnSetRNNPaddingMode() to enable unpacked padded sequences * cleaning up & adding unit tests * cleanign up * cleanign up * removing stringstream and checking for cudnn >= 7.2 * fixing whitespace formatting errors; adding ifdef version guard for cudnn padding * fixing a few syntax errors * changing order of arguments in hybird_forward for backward compatibility * more build validation fixes * using emplace_back to make linter happy * adding import of mxnet.ndarray * switching order of sequence_length in hybrid_forward again * adding __call__ override to rnn layer to handle optional sequence_length argument * whoops swapped order of args in one place but not the other * changing type() to isinstance() to make linter happy * changing lstm var seq length call to explciitly name sequence_length parameter * fixing bad scope of if-statement checking state outputs * resolving reviewer comments * making linter happy by putting var definitions in appropriate ifdef * fixing linter again * fixing whitespace issues with linter * fixing whitespace issues with linter * fixing some typos that emerged fixing linter * linter * fixing more whitespace issues * only access kTempSpace if on gpu * removing tabs that slipped in * fixing too-long line * changing ifdef guard to be more generic * reverting change so whitespace stays same w/ master * adding todo comment
Description
This is very much a work in progress (WIP).
I am trying to add support for variable length LSTMs in MxNet, and need help with someone to help me complete the PR.
This PR currently adds support for variable length sequences using the built-in support of that by cuDNN. It currently does not add CPU support but probably should.
I had to make a few changes to get it working that definitely need to change before we think about merging:
I changed the way gluon handles kwargs to model blocks. Originally I did this because I thought I needed to pass in
use_sequence_length=True, sequence_length=sequence_length
the way the current SequenceLast/SequenceReverse/etc operators do. Now I think I probably don't need this since I moved the use_sequence_length parameter to the constructor. (Although perhaps we don't need this at all? it seems clunky to me, but I was trying to follow existing conventions).Currently I hard-coded the
F.RNN
call in rnn_layer.py to look like:.RNN(inputs, params, states[0], states[1], sequence_length, ...
because this was giving me python syntax errors:.RNN(inputs, params, *states, sequence_length,...
optimizer.py had a bunch of imports that were failing for me (e.g.
multi_sgd_update
). Are they not failing for everyone else? Not sure why they just failed for me, but I had to remove to get my test script to run.I had to modify imperative_utils.h to temporarily remove the check that all ndarrays are on the same context, because cudnn expects the sequence length array to live on the CPu.
Still need to add unit tests to test this feature. currently I am using the following test script:
https://gist.github.com/stephenrawls/8a4f4cb941f0c3f32bf5a8b3b12d32c0
Unfortunately it looks like the test I have is failing. Not sure why might need help from cudnn guys to help track down.
I have tested with cudnn v7.4.2.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments