-
Notifications
You must be signed in to change notification settings - Fork 6.8k
High memory usage with bucketing #5035
Comments
Diving a little deeper into this issue. This is my current understanding of memory sharing:
So with the above one potential problem occurs when we first see a small bucket and then a larger bucket subsequently. This is because in this scenario curr_module in BucketingModule will point to the small bucket, which has a data_pool that is smaller than the data_pool_ of the default bucket module. Given this I tried to modify BucketingModule to always pass the default module as shared module (as my assumption was that the default module will always occupy more space than any given module). Now this actually works, but only when turning off all memory optimizations (NNVM_EXEC_ENABLE_INPLACE=false NNVM_EXEC_MATCH_RANGE=0). If they these are not turned off we still allocate new memory in InitDataEntryMemory as the shapes across buckets don't match up anymore. |
Thanks for the analysis. Haibin will be working on this. In the mean time the cudnn bucketing example should work better since there are less memory blobs allocated /~https://github.com/dmlc/mxnet/pull/5004/files also see #4795 |
@tdomhan I have met the same problem. For the nnvm version, even the default bucket key is the largest, some memory will be still allocated with new smaller keys, (may be caused by the memory plan strategy). So the memory will increase after some batches.
|
Problem still exits in no nnvm version, but the increasing is not very obvious. |
Here are the counts of different ndarray sizes in the default bucket's memory pool (for some arbitrary model and not the RNN example, but that shouldn't change the point):
vs the ones from a different bucket of smaller size.
The meaning of the above number is e.g. that for the default bucket we got 1 ndarray of size 256 (bytes). Now one can see that they are not really compatible with each other, as for example the 23 655360 byte arrays can't be fit into the ndarrays of the default bucket. This is probably due to the memory planning being done independently leading to incompatible sets of ndarrays. |
@tdomhan Thanks for the analysis. The problem with bucketing is that for each bucket, currently we first plan its memory unaware of the shared memory pool information. After the memory is planned, we try to reuse what's available in the shared memory pool. Because graph of big bucket and small bucket don't necessarily produce the same memory plan, it doesn't guarantee that no extra memory is allocated. I'm working on this to fix it. |
set the storage type to naive and don't keep the executor for each bucket key will help. The memory won't increase though floating in small range @tdomhan |
as the naive storage manager will do a cudaMalloc with each ndarray created (unlike the PooledStorage) I'm guessing this will decrease speed quite a bit. I don't have any numbers on this though. What's your experience with that? From my understanding PooledStorage is useful for use cases where you create many short lived executors (e.g. minpy or your use case) in order to reuse memory from previously created ndarrays. For the memory sharing between different buckets the ndarrays haven't been deallocated yet, yet we still want to share them between different graphs. |
@feiyulv Regarding PooledStorage, I believe its purpose it to avoid excessive cudaMalloc for each of the NDArrays created. We should try to reuse the NDArray malloc'ed if any in the pool. Excessive cudaMallocs could lead to much slower run time performance, depending on the workload. |
@eric-haibin-lin @tdomhan thx |
Found some inefficiency in the system and made a few changes related to memory allocation in MXNet:
Fixing 1 and 2 reduce the memory quite a lot, while 3 and 4 bring marginal reduction if 1 and 2 are fixed (5% ~ 10%). Benchmark result on LSTM workload:
Benchmark result on Neural Style workload
|
Always pass in the bucket with default_bucket_key as the shared_module while binding new buckets Imbalance version of shared pool during plan memory Auto search and updated shared mem pool Cleanup unused code Fix compilation error
Always pass in the bucket with default_bucket_key as the shared_module while binding new buckets Imbalance version of shared pool during plan memory Auto search and updated shared mem pool Cleanup unused code Fix compilation error
How long does 4 take? |
@eric-haibin-lin the bug is fixed ? the new version is ? I check the version which you said and take a look. |
@piiswrong @whaozl I'll add one more test case before merging this in. Will keep you posted. |
The memory planning is only run once per bucket, no? So the overhead shouldn't be per epoch but just a fixed time at the beginning of training. I assume that for a use case like minpy or any other use case where you dynamically construct a graph the overhead should be much higher. So does it take about 1s to do the sweep? |
If you are building graphs for trees for parsing then you can have a different graph for every batch of data. In that case 1/40s per bucket overhead is non negligible. I think a better strategy is to run more graph optimization for the first 100-1000 bind then turn off |
One other problem other than what was listed in 1-4 is the order of allocations in GraphExecutor::InitDataEntryMemory. Arrays were allocated in the order they're encountered in the graph. This could lead to situations where large ndarrays from the shared pool were used for much smaller ndarrays. Allocating the largest ndarrays first will further reduce the memory consumption. Here's a PR with the change: |
Yeah that's a good point! I actually made the same change in my PR over the weekend. /~https://github.com/dmlc/mxnet/pull/5133/files#diff-d8b5a5b027d00584737fb6486cba38b9R488 |
Oh ok. I wasn't aware of that. Glad we came to similar conclusions and
looking forward to seeing the changes merged to master :)
…On Mon, 27 Feb 2017 at 18:51, Haibin Lin ***@***.***> wrote:
Yeah that's a good point! I actually made the same change in my PR over
the weekend.
/~https://github.com/dmlc/mxnet/pull/5133/files#diff-d8b5a5b027d00584737fb6486cba38b9R488
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5035 (comment)>, or mute
the thread
</~https://github.com/notifications/unsubscribe-auth/AAWM2U3ygZbFdi468cYdFeiVLkUnFJ-kks5rgw0zgaJpZM4MCDXR>
.
|
are there any remaining blockers for merging #5133 ? |
* Imbalance version of shared pool during plan memory * Bug fix for no shared_pool case * Auto search and updated shared mem pool * Cleanup unused code * Cleanup logging code * Add unit test for shared storage * Remove shared pool in PlanMemory. Fix lint warnings * Fix lint warnings * Use reference instead of ptrs
When using the bucketing module I'd expect the memory usage to be about the same as when using the normal module unrolled to the largest bucket size. However we observe unusually high GPU memory usage in MxNet when using multiple buckets.
This can be reproduced/observed with the lstm_bucketing.py example from the latest MXNet commit as such:
in examples/rnn/lstm_bucketing.py change:
When using multiple buckets (see line 49), overall memory usage is 1419MB.
When changing line 49 to only use a single bucket (e.g. 60), overall memory usage is only 1185MB.
It should be noted that the initial memory usage for bucketing is the same (1185MB), but after a couple of batches the memory usage increases. We suspect this is due to the BucketingModule binding another sub module when a new bucket size is given by the data iterator and memory sharing across modules isn't working properly.
While for this model the difference is only 300 MB, we observed much higher differences in practice, making it difficult to train any reasonably sized model with bucketing.
Note: the default bucket key is of course the largest bucket.
The text was updated successfully, but these errors were encountered: