-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Out of memory error in 3d Conv for matrix splits > 10, CUDNN strange behaviour #14029
Comments
Thanks, I was able to repro the issue. With slice of 11, looks like it is going through different cuda code.
See details below for gpu stack trace for :10,:10 vs :11,:11 Details of gpu allocation with a = net(x[:, :, :, :10, :10])
With a = net(x[:, :, :, :11, :11])
|
This looks like cuDNN implementation bug, This is for case 11, void fft3d_r2c_16x16x16 is called , for case 10 or less : void cudnn::detail::implicit_convolveNd_sgemm is called. @ptrendx @DickJC123 Would you guys be able to help here in case you have any idea about this behavior of cuDNN or if you can point this to right people. |
We will investigate and loop-in the cudnn team. |
@mxnet-label-bot add [Bug, Cuda, memory] |
@mxnet-label-bot remove [Cuda] |
I was able to repro this OOM on a 12G Pascal. At the point of failure, it was asking for a 10G temporary workspace! Since you've set MXNET_CUDNN_AUTOTUNE_DEFAULT=0, doesn't that say you're willing to accept what cudnnGet() returns for the algo, regardless of workspace needs? |
" Since you've set MXNET_CUDNN_AUTOTUNE_DEFAULT=0, doesn't that say you're willing to accept what cudnnGet() returns for the algo, regardless of workspace needs?" Yes. But to me, it's still strange that amount of memory required when going from 10 to 11 increases by such huge factor and I wanted to check with cudnn team if this is expected and known behavior of cudnn or there is some bug causing this memory bloat. |
As pointed out earlier, going from 10 to 11 is the threshhold for when cudnn thinks the fft implementation is fastest. That algo apparently has a huge workspace requirement, probably related to being 3D. There is no cudnn bug here. You have a couple of remedies:
There is currently no way to limit algos by workspace size without also running cudnnFind(). We could add this functionality in a backward-compatible way by adding a new supported value to MXNET_CUDNN_AUTOTUNE_DEFAULT: There would be a locally set equivalent to this in the Convolution parameters: While we're at it, I'm not fond of the compiled in default workspace size of 1GB. I'd suggest adding an environment variable: MXNET_CUDNN_WORKSPACE_LIMIT_DEFAULT # If not set, then limit = 1024 (MB) |
Great, Thanks for detailed explanation. Closing this. |
Description
Memory bloat(OOM) in 3D Conv when matrix split size is greater than 10 . If I run this code on ec2 p2.xl with 12 GB of GPU memory, program runs well for
a = net(x[:, :, :, :10, :10])
print(a.shape)
a = net(x[:, :, :, :9, :9])
print(a.shape)
but starts getting cuda OOM error for :
a = net(x[:, :, :, :11, :11])
print(a.shape)
Environment info (Required)
mxnet-cu92==1.3.1, gluoncv==0.3.0
Cuda 9.2, cudnn 7.1
instance used : p2.xl on ec2 , 12 GB of GPU memory
The text was updated successfully, but these errors were encountered: