Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MxNet R, CNN, VRAM consumption explodes dramatically in dependence of number of filters #10721

Closed
thomasmooon opened this issue Apr 27, 2018 · 6 comments

Comments

@thomasmooon
Copy link

thomasmooon commented Apr 27, 2018

Description

I have a toy dataset of 360 samples with 4096 data points each, leading to a tensor of shape (4096,1,360). Hence, each observation has a size of ~ 4 kB. The CNN is very simple: Conv -> flatten -> fully connected -> fully connected -> softmax:
image

The VRAM consumption explodes in dependence of the number of filters: Please see the table and the related picture below. Regarding the influence of the kernel size and the batch size: These have very small influence, I've tested several combinations, but I omit these details for now. The tables measure a setting using 2 GPUs of my environment (described in the environment setting below). As one can see the VRAM demand of each card increases, as expected, linear with the number of convolution filters. But if it exceeds 10, then the GPUs run out of their 8 GB VRAM. What the hell...?

It is also remarkable, that in a setting with 1 GPU and 8 kernels is not possible: It exhausts the 8 GB RAM of the single Card. But using 2 GPUs with everything else unchanged, then each GPU consumes only 0.477 GB, so 2x0.477 = 0.95 GB in total. This is far beyond of what is consumed when using only 1 Card. How can this be??

Things else tested without any effect: The argument workspace in the mx.symbol.Convolution()-Function. I played with several values: 1, 64, 128, 512 MB. But his had absolutely none effect disregarding to any combination of varying number of filters. Here's the defintion of workspace :

long (non-negative), optional, default=1024 Maximum temporary workspace allowed for convolution (MB)

#VRAM consumption in dependence of the number of filters, using 2 GPUs

n_filter VRAM Consumption / Card
1 313
2 339
4 385
8 477
10 523
11 out of memory

image

In addition I measured the RAM consumption if the device is CPU, hence no usage of GPUs. I tried values of 10, 11 and 20 filters. What you can see is, that the RAM consumption increases linear, especially when increasing vom 10 to 11, rather than exploding if the device are GPUs. This is confusing. In addition, the RAM consumption using 10 filters is 9 GB, in alignment with the observation, that the VRAM of 8 GB of one GPU is insufficient. But, again, in contradiction to the 0.95 GB if 2 GPUs are used.

image

For R user, please provide R sessionInfo():

R version 3.4.3 (2017-11-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] bindrcpp_0.2 mxnet_0.10.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.12 compiler_3.4.3 RColorBrewer_1.1-2 influenceR_0.1.0
[5] plyr_1.8.4 bindr_0.1 viridis_0.4.0 tools_3.4.3
[9] digest_0.6.12 jsonlite_1.5 tibble_1.3.3 gtable_0.2.0
[13] viridisLite_0.2.0 rgexf_0.15.3 pkgconfig_2.0.1 rlang_0.1.1
[17] igraph_1.1.2 rstudioapi_0.6 yaml_2.1.14 gridExtra_2.2.1
[21] DiagrammeR_0.9.0 dplyr_0.7.2 stringr_1.2.0 htmlwidgets_0.9
[25] grid_3.4.3 glue_1.1.1 R6_2.2.2 Rook_1.1-1
[29] XML_3.98-1.9 ggplot2_2.2.1 magrittr_1.5 codetools_0.2-15
[33] scales_0.4.1 htmltools_0.3.6 assertthat_0.2.0 colorspace_1.3-2
[37] brew_1.0-6 stringi_1.1.5 visNetwork_2.0.0 lazyeval_0.2.0
[41] munsell_0.4.3

Hardware

8 x 1080 TI
60 GB RAM
12 Cores

cuda version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85

Minimum reproducible example


require(mxnet)

# create toy data
#-------------------------------------------------------------------------------

nSamp <- 360
nObs <- 64*64 # =4096

# create labels
nLabel <- 40
set.seed(1) # seed for label sampling
label <- sample(1:40,nSamp,T)

# create training data set
train <- sapply(label, function(x) rpois(nObs, x)) # dim = 360 x 4096 = nSamp x nObs
dim(train) <- c(4096,1,1,360)

trainIter <-
  mx.io.arrayiter(
    data = train,
    label = label,
    batch.size = 128,
    shuffle = T
  )

# measure influence to VRAM / RAM demand
#-------------------------------------------------------------------------------

The results in this example slightly differ from the descriptions above. This is just because in my true example, I use true data. In this example, data is sampled from random numbers for reproducibility. The effect of exploding VRAM in this example starts when exceeding 10 filters.

kernel <- c(64*12,1) 

# example with 1 CPU ####
  # nGPU <- 1
  # array.batch.size <- 1
  # workspace <- 1024
  # num_filter <- 1 # 3.8 GB VRAM 
  # num_filter <- 2  # 6.4 GB VRAM
  # num_filter <- 3 # out of memory

# example with 2 GPU ###
  nGPU <- 2
  array.batch.size <- 1
  workspace <- 1024
  # num_filter <- 1 # 0.313 GB VRAM / Card
  # num_filter <- 2  # 0.339 GB VRAM / Card
  # num_filter <- 4  # 0.385
  # num_filter <- 8  # 0.477
  num_filter <- 10  # 0.523
  # num_filter <- 11  # out of memory
  # num_filter <- 16  # out of memory


# device setup
#-------------------------------------------------------------------------------
devices <- lapply(seq(nGPU)-1, mx.gpu)
# devices <- mx.cpu()


# Set up the symbolic model
#-------------------------------------------------------------------------------
data <- mx.symbol.Variable('data')
# convolution
conv_1 <- mx.symbol.Convolution(data = data, kernel = kernel, num_filter = num_filter, workspace = workspace) 
tanh_1 <- mx.symbol.Activation(data = conv_1, act_type = "tanh")
# 1st fully connected layer
bn_2 <- mx.symbol.BatchNorm(tanh_1)
flatten <- mx.symbol.Flatten(data = bn_2)
fc_1 <- mx.symbol.FullyConnected(data = flatten, num_hidden = 500)
tanh_3 <- mx.symbol.Activation(data = fc_1, act_type = "tanh")
# 2nd fully connected layer
bn_3 <- mx.symbol.BatchNorm(tanh_3)
fc_2 <- mx.symbol.FullyConnected(data = bn_3, num_hidden = 40)
# Output. Softmax output since we'd like to get some probabilities.
NN_model <- mx.symbol.SoftmaxOutput(data = fc_2)


# graph.viz(NN_model)

# Pre-training set up
#-------------------------------------------------------------------------------

# Set seed for reproducibility
mx.set.seed(100)


# Training
#-------------------------------------------------------------------------------

# Train the model
model <- mx.model.FeedForward.create(
  NN_model,
  kvstore = "local",
  X = trainIter,
  ctx = devices,
  num.round = 150,
  learning.rate = 0.01,
  momentum = 0.9,
  eval.metric = mx.metric.accuracy,
  epoch.end.callback = mx.callback.log.train.metric(array.batch.size)
)

Steps to reproduce

Comment / uncomment the lines in the section

# measure influence to VRAM / RAM demand
#-------------------------------------------------------------------------------

and use nvidia-smi -l 3 to monitor memory consumption. I recommend to run script not in R, rather from shell for convenience (R will crash when VRAM exceeds).

To measure the RAM consumption using CPU, comment content in this section and monitor e. g. with htop

# device setup
#-------------------------------------------------------------------------------

What have you tried to solve it?

Varied these parameters:

  • batch size (1,2,64,128)
  • different kernels: c(64*12,1), c(64,64,1), c(64/2,64 * 2,1), ...
  • workspace : 1, 64, 128, 512, 1024 MB
  • single-GPU / multi-GPU / CPU - device tests
  • asked colleagues
  • asked google
  • 4D-tensor with 2D-kernel e. g. (4096,1,1,360) x (64*12,1)
  • 3D-tensor with 1D-Kernel, e. g. (4096,1,360) x (64*12)
@thomasmooon thomasmooon changed the title MxNet R, CNN, VRAM consumption explodes dramatically with filters MxNet R, CNN, VRAM consumption explodes dramatically in dependence of number of filters Apr 27, 2018
@jeremiedb
Copy link
Contributor

@thomasmooon maybe you can test whether #11374 effectively solves this RAM comsumtionissue?

@jeremiedb
Copy link
Contributor

@thomasmooon I just ran your example with num_filter = 32 and no workspace parameter and model ran properly on a single 1060, staying aroung stable around 2.7 Go Ram on the GPU.

@nswamy Can you close this issue?

@nswamy
Copy link
Member

nswamy commented Jun 29, 2018

thanks @jeremiedb

@nswamy nswamy closed this as completed Jun 29, 2018
@thomasmooon
Copy link
Author

@jeremiedb I was on vacancy leave and just read your posts. Thanks for your suggestion. But in the meanwhile, a few weeks after I opened the issue, I switched to another DL framework for several reasons.

@jeremiedb
Copy link
Contributor

@thomasmooon Sure I understand as the support for R-package hasn't been great. May I ask you if there were other specific features you were seeing as lacking? Thanks!

@thomasmooon
Copy link
Author

@jeremiedb Well, in general my experience is that a better documentation is desirable. Especially with minimum reproducible runnable examples for R for each layer / method. Hence, if I would restart with MXNet I'd first learn python and then use the MXNet Python. This doesn't answer your "specific feature" question, there were / are a lot of small things in my use cases demanding hacking a lot around using MXNet whilst in my framework of current choice this is not the case.
Special hallmarks of MXNet, like a relatively high speed are valuable in general of course, but not that critical in my case.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants