Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the Xavier Initializer #5270

Merged
merged 4 commits into from
Nov 2, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 130 additions & 1 deletion python/paddle/v2/framework/initializer.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
import paddle.v2.framework.framework as framework
import numpy as np

__all__ = ['ConstantInitializer', 'UniformInitializer']
__all__ = [
'ConstantInitializer', 'UniformInitializer', 'NormalInitializer',
'XavierInitializer'
]


class Initializer(object):
Expand All @@ -20,6 +24,41 @@ def __call__(self, param, block):
"""
raise NotImplementedError()

def _compute_fans(self, var):
"""Compute the fan_in and the fan_out for layers

This method computes the fan_in and the fan_out
for neural network layers, if not specified. It is
not possible to perfectly estimate fan_in and fan_out.
This method will estimate it correctly for matrix multiply and
convolutions.

Args:
var: variable for which fan_in and fan_out have to be computed

Returns:
tuple of two integers (fan_in, fan_out)
"""
shape = var.shape
if not shape or len(shape) == 0:
fan_in = fan_out = 1
elif len(shape) == 1:
fan_in = fan_out = shape[0]
elif len(shape) == 2:
# This is the case for simple matrix multiply
fan_in = shape[0]
fan_out = shape[1]
else:
# Assume this to be a convolutional kernel
# In PaddlePaddle, the shape of the kernel is like:
# [num_filters, num_filter_channels, ...] where the remaining
# dimensions are the filter_size
receptive_field_size = np.prod(shape[2:])
fan_in = shape[1] * receptive_field_size
fan_out = shape[0] * receptive_field_size

return (fan_in, fan_out)


class ConstantInitializer(Initializer):
"""Implements the constant initializer
Expand Down Expand Up @@ -156,3 +195,93 @@ def __call__(self, var, block):
})
var.op = op
return op


class XavierInitializer(Initializer):
"""Implements the Xavier initializer

This class implements the Xavier weight initializer from the paper
Understanding the difficulty of training deep feedforward neural
networks[1] by Xavier Glorot and Yoshua Bengio.

This initializer is designed to keep the scale of the gradients
approximately same in all the layers. In case of Uniform distribution,
the range is [-x, x], where x = sqrt(6 / (fan_in + fan_out)).
In case of Normal distribution, the mean is 0 and the standard deviation
is sqrt(2/ (fan_in + fan_out)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommended scale/std for different nonlinearity functions are different. I think we can borrow the idea from PyTorch to make this initializer more general.
/~https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L8
/~https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L184

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original paper does not talk about Relus and gain because this paper was before the time when RELUs became popular. I also looked at tensorflow and keras source code. They have kept this initialization as it is defined in the paper. Maybe we can merge this for now and then add the gain later in a separate PR when we have more knowledge about it. I can read few more papers to see how the gain attribute is used. I do not think it is right to borrow the idea directly without researching it thoroughly. We can merge this and can look at this in more detail after refactoring is complete. What do you suggest?


References:
[1] Understanding the difficulty of training deep feedforward neural
networks. International conference on artificial intelligence and
statistics.
(http://proceedings.mlr.press/v9/glorot10a.html)
"""
Copy link
Contributor

@lcy-seso lcy-seso Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to read your comments~ Quite easy to follow.

Just for my personal interests. I remember this is a recommended initializer for tanh, the usually recommended initializer varies according to different activations.

I think we can just add all these helpful initializers into our framework for the first goal, and later to choose a much better initialization strategy to encode the best practice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. I agree with you.


def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal distribution is used more frequently than Uniform distribution. Shall we change the default value of uniform to False

Copy link
Contributor

@lcy-seso lcy-seso Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think uniform distribution is acceptable for me, my concern is we are not in a stage that we are intensively encoding the best practices from the community into our framework. We can carefully tune this later.

But uniform distribution from [-1, 1] as a default initialization seems terrible (only from my personal experience. It seems too large). Maybe, we can reduce the region limit into le-3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at tensorflow where they keep uniform as the default. That is why I chose uniform as the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it~ thank you for the information.

"""Constructor for XavierInitializer

Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add descriptions for the default values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback. Could you give me an example of the description you are referring to. I have talked about fan_in and fan_out in the Args section of my docstring.

Copy link
Contributor

@pengli09 pengli09 Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ignore this comment, I didn't realize that the signature of the function will be displayed in the doc.

For example, "uniform: .... Default value is True.". Otherwise, one may need to dig into the code to find what are the default values.

uniform: whether to use uniform or normal distribution
fan_in: fan_in for Xavier initialization. If None, it is
inferred from the variable.
fan_out: fan_out for Xavier initialization. If None, it is
inferred from the variable.
seed: random seed

Note: It is recommended to set fan_in and fan_out to None for
most cases.
"""
assert uniform is not None
assert seed is not None
super(XavierInitializer, self).__init__()
self._uniform = uniform
self._fan_in = fan_in
self._fan_out = fan_out
self._seed = seed

def __call__(self, var, block):
"""Add xavier initialization ops for a variable

Args:
var: Variable that needs to be initialized
block: The block in which initialization ops
should be added

Returns:
the initialization op
"""
assert isinstance(var, framework.Variable)
assert isinstance(block, framework.Block)
f_in, f_out = self._compute_fans(var)

# If fan_in and fan_out are passed, use them
fan_in = f_in if self._fan_in is None else self._fan_in
fan_out = f_out if self._fan_out is None else self._fan_out

if self._uniform:
limit = np.sqrt(6.0 / float(fan_in + fan_out))
op = block.prepend_op(
type="uniform_random",
outputs={"Out": var},
attrs={
"shape": var.shape,
"data_type": int(var.data_type),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Block. Is int correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is correct. We pass the data type as integer which is then mapped to a given type.

"min": -limit,
"max": limit,
"seed": self._seed
})

else:
std = np.sqrt(2.0 / float(fan_in + fan_out))
op = block.prepend_op(
type="gaussian_random",
outputs={"Out": var},
attrs={
"shape": var.shape,
"data_type": int(var.data_type),
"mean": 0.0,
"std": std,
"seed": self._seed
})
var.op = op
return op
107 changes: 107 additions & 0 deletions python/paddle/v2/framework/tests/test_initializer.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import numpy as np
import unittest

import paddle.v2.framework.framework as framework
Expand Down Expand Up @@ -116,5 +117,111 @@ def test_normal_initializer(self):
self.assertEqual(init_op.attr('seed'), 123)


class TestXavierInitializer(unittest.TestCase):
def test_uniform_xavier_initializer(self):
"""Test Xavier initializer with uniform distribution on
for matrix multiply.
"""
program = framework.Program()
block = program.global_block()
param = block.create_parameter(
dtype="float32",
shape=[5, 10],
lod_level=0,
name="param",
initializer=initializer.XavierInitializer())
self.assertEqual(len(block.ops), 1)
init_op = block.ops[0]
self.assertEqual(init_op.type, 'uniform_random')
limit = np.sqrt(6.0 / (param.shape[0] + param.shape[1]))
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
self.assertEqual(init_op.attr('seed'), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also test whether seed can be set properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4e1fa1b


def test_uniform_xavier_initializer_conv(self):
"""Test Xavier initializer with uniform distribution on
for convolutions.
"""
program = framework.Program()
block = program.global_block()
param = block.create_parameter(
dtype="float32",
shape=[5, 10, 15, 20],
lod_level=0,
name="param",
initializer=initializer.XavierInitializer())
self.assertEqual(len(block.ops), 1)
init_op = block.ops[0]
self.assertEqual(init_op.type, 'uniform_random')
receptive_field_size = float(15 * 20)
limit = np.sqrt(6.0 / (
(param.shape[0] + param.shape[1]) * receptive_field_size))
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
self.assertEqual(init_op.attr('seed'), 0)

def test_normal_xavier_initializer(self):
"""Test Xavier initializer with normal distribution on
for matrix multiply.
"""
program = framework.Program()
block = program.global_block()
param = block.create_parameter(
dtype="float32",
shape=[5, 10],
lod_level=0,
name="param",
initializer=initializer.XavierInitializer(uniform=False))
self.assertEqual(len(block.ops), 1)
init_op = block.ops[0]
self.assertEqual(init_op.type, 'gaussian_random')
std = np.sqrt(2.0 / (param.shape[0] + param.shape[1]))
self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA)
self.assertAlmostEqual(init_op.attr('std'), std, delta=DELTA)
self.assertEqual(init_op.attr('seed'), 0)

def test_normal_xavier_initializer_conv(self):
"""Test Xavier initializer with normal distribution on
for convolutions.
"""
program = framework.Program()
block = program.global_block()
param = block.create_parameter(
dtype="float32",
shape=[5, 10, 15, 20],
lod_level=0,
name="param",
initializer=initializer.XavierInitializer(uniform=False))
self.assertEqual(len(block.ops), 1)
init_op = block.ops[0]
self.assertEqual(init_op.type, 'gaussian_random')
receptive_field_size = float(15 * 20)
std = np.sqrt(2.0 / (
(param.shape[0] + param.shape[1]) * receptive_field_size))
self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA)
self.assertAlmostEqual(init_op.attr('std'), std, delta=DELTA)
self.assertEqual(init_op.attr('seed'), 0)

def test_xavier_initializer_supplied_arguments(self):
"""Test the Xavier initializer with supplied arguments
"""
program = framework.Program()
block = program.global_block()
block.create_parameter(
dtype="float32",
shape=[5, 10],
lod_level=0,
name="param",
initializer=initializer.XavierInitializer(
fan_in=12, fan_out=23, seed=134))
self.assertEqual(len(block.ops), 1)
init_op = block.ops[0]
self.assertEqual(init_op.type, 'uniform_random')
limit = np.sqrt(6.0 / (12 + 23))
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA)
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA)
self.assertEqual(init_op.attr('seed'), 134)


if __name__ == '__main__':
unittest.main()