Skip to content
This repository has been archived by the owner on May 24, 2018. It is now read-only.

Latest commit

 

History

History
272 lines (229 loc) · 8.35 KB

layer.md

File metadata and controls

272 lines (229 loc) · 8.35 KB

Introduction

This page introduces the layer related configurations of cxxnet.

Layer Specification

All layer configurations comes into

netconfig = start
layer[from->to] = layer_type:name
netconfig = end
  • from is the from node name, 0 means input data
  • to is the to node name.
  • layer_type is described below
  • name is an optional, but if you need to finetune the network to other task, name is a must, since it is used to indicate which layer to be copied.
Weight Initialization

Fully_Connected_Layers and Convolution_Layers require random weight initialization. We provide two initialization methods: gaussian and xaview:

random_type = gaussian
init_sigma = 0.01

We extra provide Xavier initialization method[1], by using the configuration

random_type = xavier

Global setting can be override in the layer configuration, eg

# global setting
random_type = gaussian
netconfig = start
wmat:lr  = 0.01
wmat:wd  = 0.0005
bias:wd  = 0.000
bias:lr  = 0.02
layer[0->1] = fullc:fc1
  # local setting start
  nhidden = 50
  random_type = xavier
  # local setting end 
layer[1->2] = relu
layer[2-3] = fullc
  # local setting start
  nhidden = 6
  init_sigma = 0.005
  wmat:lr = 0.1
  # local setting end
netconfig = end

By using this configuration, the fc1 layer will use Xavier method to initialize, while fully connected layer without name will use Gaussian random number with mu=0, sigma=0.005 to do initialization. Meanwhile fully connected layer without name will use a learning rate different with global.

=

Layer Types

= Connection Layer

= Activation Layer

= Loss Layer

= Computation Layers

= Pooling Layers

= Other Layers

=

Connection Layer

Flatten Layer
  • Flatten Layer is used for flatten convolution layer. After flattening, we can use convolution output in the feed forward neural network. Namely, the shape of the output node is transformed to (batch, 1, 1, num_feature) instead of (batch, channel, width, height). Here is an example:
layer[15->16] = flatten
Split Layer
  • Split Layer is used for one-to-multi connection. It duplicate the input node in forward pass, and accumulated the gradient from output nodes in backward pass.
layer[15->16,17] = split
Concat Layer
  • Concat Layer is used to concatenate the last dimension (namely, num_feature) of the output of two nodes. It is usually used along with fully connected layer.
layer[18,19->20] = concat
Channel Concat Layer
  • Channel Concat Layer is used to concatenate the second dimension (namely, channel) of the output of two nodes. It is usually used along with convolution layer.
layer[18,19->20] = ch_concat

=

Activation Layer

We provide common active layers including , Rectified Linear (RELU), Sigmoid , Tanh and Parametric_RELU (pRELU).

=

Rectified Linear
  • The output of Rectified Linear is max(0, x). This is the most commonly used activation function in modern deep learning method.
layer[15->16] = relu

=

Tanh
  • Tanh uses the tanh as activation function. It transforms the input into range [-1, 1].
layer[15->16] = tanh

=

Sigmoid
  • Sigmoid uses the sigmoid as activation function. It transforms the input into range [0, 1].
layer[15->16] = sigmoid

=

Parametric Rectified Linear
  • pRELU is basically the implementation of [2]. In addition, we provide a parameter to add noise to the negative slope to reduce overfitting.
layer[15->16] = prelu
  random=0.5
  • random[optional] denotes standard deviation of the gaussian distribution randomly added to the negative part of pRELU. In testing, this noise part is discarded.

=

Loss Layer

Loss layers are self-looped layer. It defines the loss function for training.

  • Common Parameters:
  • grad_scale[optional]: scale the gradient generated by loss layer

=

Softmax
  • Softmax Loss Layer is the implementation of multi-class softmax loss function.

=

Euclidean
  • Euclidean Loss Layer is the implementation of elementwise l2 loss function.

=

Elementwise Logistic
  • Elementwise Logistic Loss Layer is the implementation of elementwise logistic loss function. It is suitable to multi-label classification problem.

=

Computation Layers

Fully Connected Layer
  • Fully Connection Layer fully connection layer is the basic element in feed forward neural network.
layer[18->19] = fullc
  nhidden = 1024
  • nhidden denotes the number of hidden units in the layer.

=

Convolution Layer

If built with CuDNN, the default convolution is CuDNN R2. If there is no CuDNN R2, convolution will be run on our own kernel. The configuration looks like

layer[0->1] = conv
  kernel_size = 11
  stride = 4
  nchannel = 96
  pad = 1
  • kernel_size is the convolution kernel size
  • stride is stride for convolution operation
  • nchannel is the output channel
  • pad is the number of pad
  • temp_col_max[optional] is the maximum size of expanding in convolution operation. The default value is 64, means the maximum size of temp_col is 64MB. Adjusting this variable may boost speed in training especially the input size is small in the convolution network. Note that this will only take effect when not using CuDNN.

=

Pooling Layer

Currectly we provide three Pooling methods: Sum Pooling , Max Pooling and Average Pooling . All pooling layers shared same parameters: stride and kernel_size

=

Sum Pooling
  • Sum Pooling sums up the values in the pooling region as result , eg
layer[4->5] = sum_pooling
  kernel_size = 3
  stride = 2
Max Pooling
  • Max Pooling takes the maximum value in the pooling region as result, eg
layer[4->5] = max_pooling
  kernel_size = 3
  stride = 2
Average Pooling
  • Average Pooling averages the values in the pooling region as result , eg
layer[4->5] = avg_pooling
  kernel_size = 3
  stride = 2

=

Other Layers

Dropout
  • Note that Dropout Layer is a self loop layer. You need to set to equal the from, eg
layer[3->3] = dropout:dp
  threshold = 0.5
  • threshold is the probability to drop an output.

=

Local Response Normalization

LRN normalizes the response of nearby kernels. Details can be found in the Alex's paper[3].

layer[3->4] = lrn
  local_size = 5
  alpha = 0.001
  beta = 0.75
  knorm = 1
  • local_size denotes the nearby kernel size to be evaluated
  • alpha, beta and knorm is normalization param.

=

Batch Normalization Layer

BN layer is an implementation of [4]. The difference is that in testing, we only use the mini-batch statistics instead of global statistics in training data as in original paper. It is an experimental layer that may not stable. To use the layer, you need to set

layer[3->4] = batch_norm

There is no parameter for this layer.

=

References

[1] Glorot Xavier, and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." AISTATS. 2010.

[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv preprint arXiv:1502.01852. 2015.

[3] Krizhevsky Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS. 2012.

[4] Ioffe Sergey, and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." arXiv preprint arXiv:1502.03167. 2015.