Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Gluon RNN memory leaks with extra variables #13951

Closed
yifeim opened this issue Jan 21, 2019 · 10 comments · Fixed by #14365 or #14480
Closed

Gluon RNN memory leaks with extra variables #13951

yifeim opened this issue Jan 21, 2019 · 10 comments · Fixed by #14365 or #14480
Labels
Backend Issues related to the backend of MXNet Bug CUDA Gluon Performance

Comments

@yifeim
Copy link
Contributor

yifeim commented Jan 21, 2019

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

Gluon allows one to define extra variables that may not lead to model outcome. However, having them may cause memory leak.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.3.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform     : Linux-4.14.77-70.82.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-172-16-95-144
release      : 4.14.77-70.82.amzn1.x86_64
version      : #1 SMP Mon Dec 3 20:01:27 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2706.669
BogoMIPS:              4600.11
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for MXNet: /~https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 1.0198 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0912 sec, LOAD: 0.1530 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.5845 sec, LOAD: 0.1434sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0089 sec, LOAD: 0.1170 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0100 sec, LOAD: 0.3888 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0104 sec, LOAD: 0.0782 sec.```

Package used (Python/R/Scala/Julia): Python

Error Message:

If you run watch -n0.1 nvidia-smi, you may observe memory growth every by 2MB every few seconds.

Minimum reproducible example

See mxnet-memory-leak.tar.gz
The main differences between the attachment and examples/gluon/language_model/ are to add extra on Line 56 in model.py add to add mx.nd.array([], ctx=context) on Line 166 and 183 in train.py

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. python train.py --cuda --tied --nhid 200 --emsize 200 --epochs 20 --dropout 0.2 &
2. watch -n0.1 nvidia-smi

What have you tried to solve it?

  1. Add a dummy link between all inputs and outputs. However, this may not always be possible / convenient / readable.
  2. I previously suggested a feature request to allow None input types in the gluon models. Communicated with @szha that this would not be fundamentally challenging. However, this has not been acted upon and may be a low-hanging fruit alongside the memory fix leak.

Related: #13247

@piyushghai
Copy link
Contributor

@mxnet-label-bot Add [Gluon, Performance]

@apeforest
Copy link
Contributor

@mxnet-label-bot add [backend, cuda]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet CUDA labels Jan 22, 2019
@apeforest
Copy link
Contributor

@yifeim I am looking into this issue.

@yifeim
Copy link
Contributor Author

yifeim commented Feb 1, 2019

@apeforest Why is this not a bug?

@apeforest
Copy link
Contributor

@yifeim Sorry, got too busy and haven't got chance to dive deep into this. Yes, I think it's a bug. @mxnet-label-bot add [Bug]

@yuxihu
Copy link
Member

yuxihu commented Mar 7, 2019

The memory leak is related to the extra unused variable you passed into your RNN model but it is NOT specific to RNN. In your repro script, you created a size-zero ndarray in each loop which caused the memory leak.

for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, mx.nd.array([], ctx=context))

However, since the size-zero ndarray is unused anywhere, it is a better code practice to create once outside the loop and use it throughout your training. The same change applies to the eval() function in your repro script.

extra = mx.nd.array([], ctx=context)
for epoch in range(args.epochs):
    ...
    for i, (data, target) in enumerate(train_data):
        ...
        with autograd.record():
            ....
            output, hidden = model(data, hidden, extra)

With this change, I ran your repro script for 10 epochs with mxnet_cu90mkl 1.3.1 and 1.4.0 packages and did not see memory leak.

But there is indeed a memory leak issue which is the root cause for this issue. Please refer to #14358 for more details.

@yuxihu
Copy link
Member

yuxihu commented Mar 7, 2019

@yifeim After a little bit more digging, I think the issue is specifically related the usage of size-zero ndarray for your extra variable. If you just use mx.nd.array([1], ctx=context) as the extra variable in the loop of your repro script, you will not observe any memory leak. The true problem is creating size-zero ndarray in a loop.

@yifeim
Copy link
Contributor Author

yifeim commented Mar 8, 2019

Very interesting. Thanks a lot for the insights!

@lupesko
Copy link
Contributor

lupesko commented Mar 11, 2019

Thanks for handling @yuxihu!

@yuxihu
Copy link
Member

yuxihu commented Mar 20, 2019

@anirudh2290 Could you please reopen this? The original fix has been reverted due to test flakiness. I am working on alternative fix.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Bug CUDA Gluon Performance
Projects
None yet
7 participants