-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the Xavier Initializer #5270
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,10 @@ | ||
import paddle.v2.framework.framework as framework | ||
import numpy as np | ||
|
||
__all__ = ['ConstantInitializer', 'UniformInitializer'] | ||
__all__ = [ | ||
'ConstantInitializer', 'UniformInitializer', 'NormalInitializer', | ||
'XavierInitializer' | ||
] | ||
|
||
|
||
class Initializer(object): | ||
|
@@ -20,6 +24,41 @@ def __call__(self, param, block): | |
""" | ||
raise NotImplementedError() | ||
|
||
def _compute_fans(self, var): | ||
"""Compute the fan_in and the fan_out for layers | ||
|
||
This method computes the fan_in and the fan_out | ||
for neural network layers, if not specified. It is | ||
not possible to perfectly estimate fan_in and fan_out. | ||
This method will estimate it correctly for matrix multiply and | ||
convolutions. | ||
|
||
Args: | ||
var: variable for which fan_in and fan_out have to be computed | ||
|
||
Returns: | ||
tuple of two integers (fan_in, fan_out) | ||
""" | ||
shape = var.shape | ||
if not shape or len(shape) == 0: | ||
fan_in = fan_out = 1 | ||
elif len(shape) == 1: | ||
fan_in = fan_out = shape[0] | ||
elif len(shape) == 2: | ||
# This is the case for simple matrix multiply | ||
fan_in = shape[0] | ||
fan_out = shape[1] | ||
else: | ||
# Assume this to be a convolutional kernel | ||
# In PaddlePaddle, the shape of the kernel is like: | ||
# [num_filters, num_filter_channels, ...] where the remaining | ||
# dimensions are the filter_size | ||
receptive_field_size = np.prod(shape[2:]) | ||
fan_in = shape[1] * receptive_field_size | ||
fan_out = shape[0] * receptive_field_size | ||
|
||
return (fan_in, fan_out) | ||
|
||
|
||
class ConstantInitializer(Initializer): | ||
"""Implements the constant initializer | ||
|
@@ -156,3 +195,93 @@ def __call__(self, var, block): | |
}) | ||
var.op = op | ||
return op | ||
|
||
|
||
class XavierInitializer(Initializer): | ||
"""Implements the Xavier initializer | ||
|
||
This class implements the Xavier weight initializer from the paper | ||
Understanding the difficulty of training deep feedforward neural | ||
networks[1] by Xavier Glorot and Yoshua Bengio. | ||
|
||
This initializer is designed to keep the scale of the gradients | ||
approximately same in all the layers. In case of Uniform distribution, | ||
the range is [-x, x], where x = sqrt(6 / (fan_in + fan_out)). | ||
In case of Normal distribution, the mean is 0 and the standard deviation | ||
is sqrt(2/ (fan_in + fan_out)). | ||
|
||
References: | ||
[1] Understanding the difficulty of training deep feedforward neural | ||
networks. International conference on artificial intelligence and | ||
statistics. | ||
(http://proceedings.mlr.press/v9/glorot10a.html) | ||
""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like to read your comments~ Quite easy to follow. Just for my personal interests. I remember this is a recommended initializer for tanh, the usually recommended initializer varies according to different activations. I think we can just add all these helpful initializers into our framework for the first goal, and later to choose a much better initialization strategy to encode the best practice. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the feedback. I agree with you. |
||
|
||
def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Normal distribution is used more frequently than Uniform distribution. Shall we change the default value of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think uniform distribution is acceptable for me, my concern is we are not in a stage that we are intensively encoding the best practices from the community into our framework. We can carefully tune this later. But uniform distribution from [-1, 1] as a default initialization seems terrible (only from my personal experience. It seems too large). Maybe, we can reduce the region limit into le-3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was looking at tensorflow where they keep uniform as the default. That is why I chose uniform as the default. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it~ thank you for the information. |
||
"""Constructor for XavierInitializer | ||
|
||
Args: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add descriptions for the default values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the feedback. Could you give me an example of the description you are referring to. I have talked about fan_in and fan_out in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just ignore this comment, I didn't realize that the signature of the function will be displayed in the doc.
|
||
uniform: whether to use uniform or normal distribution | ||
fan_in: fan_in for Xavier initialization. If None, it is | ||
inferred from the variable. | ||
fan_out: fan_out for Xavier initialization. If None, it is | ||
inferred from the variable. | ||
seed: random seed | ||
|
||
Note: It is recommended to set fan_in and fan_out to None for | ||
most cases. | ||
""" | ||
assert uniform is not None | ||
assert seed is not None | ||
super(XavierInitializer, self).__init__() | ||
self._uniform = uniform | ||
self._fan_in = fan_in | ||
self._fan_out = fan_out | ||
self._seed = seed | ||
|
||
def __call__(self, var, block): | ||
"""Add xavier initialization ops for a variable | ||
|
||
Args: | ||
var: Variable that needs to be initialized | ||
block: The block in which initialization ops | ||
should be added | ||
|
||
Returns: | ||
the initialization op | ||
""" | ||
assert isinstance(var, framework.Variable) | ||
assert isinstance(block, framework.Block) | ||
f_in, f_out = self._compute_fans(var) | ||
|
||
# If fan_in and fan_out are passed, use them | ||
fan_in = f_in if self._fan_in is None else self._fan_in | ||
fan_out = f_out if self._fan_out is None else self._fan_out | ||
|
||
if self._uniform: | ||
limit = np.sqrt(6.0 / float(fan_in + fan_out)) | ||
op = block.prepend_op( | ||
type="uniform_random", | ||
outputs={"Out": var}, | ||
attrs={ | ||
"shape": var.shape, | ||
"data_type": int(var.data_type), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not familiar with Block. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this is correct. We pass the data type as integer which is then mapped to a given type. |
||
"min": -limit, | ||
"max": limit, | ||
"seed": self._seed | ||
}) | ||
|
||
else: | ||
std = np.sqrt(2.0 / float(fan_in + fan_out)) | ||
op = block.prepend_op( | ||
type="gaussian_random", | ||
outputs={"Out": var}, | ||
attrs={ | ||
"shape": var.shape, | ||
"data_type": int(var.data_type), | ||
"mean": 0.0, | ||
"std": std, | ||
"seed": self._seed | ||
}) | ||
var.op = op | ||
return op |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
import numpy as np | ||
import unittest | ||
|
||
import paddle.v2.framework.framework as framework | ||
|
@@ -116,5 +117,111 @@ def test_normal_initializer(self): | |
self.assertEqual(init_op.attr('seed'), 123) | ||
|
||
|
||
class TestXavierInitializer(unittest.TestCase): | ||
def test_uniform_xavier_initializer(self): | ||
"""Test Xavier initializer with uniform distribution on | ||
for matrix multiply. | ||
""" | ||
program = framework.Program() | ||
block = program.global_block() | ||
param = block.create_parameter( | ||
dtype="float32", | ||
shape=[5, 10], | ||
lod_level=0, | ||
name="param", | ||
initializer=initializer.XavierInitializer()) | ||
self.assertEqual(len(block.ops), 1) | ||
init_op = block.ops[0] | ||
self.assertEqual(init_op.type, 'uniform_random') | ||
limit = np.sqrt(6.0 / (param.shape[0] + param.shape[1])) | ||
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA) | ||
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA) | ||
self.assertEqual(init_op.attr('seed'), 0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should also test whether seed can be set properly. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed in 4e1fa1b |
||
|
||
def test_uniform_xavier_initializer_conv(self): | ||
"""Test Xavier initializer with uniform distribution on | ||
for convolutions. | ||
""" | ||
program = framework.Program() | ||
block = program.global_block() | ||
param = block.create_parameter( | ||
dtype="float32", | ||
shape=[5, 10, 15, 20], | ||
lod_level=0, | ||
name="param", | ||
initializer=initializer.XavierInitializer()) | ||
self.assertEqual(len(block.ops), 1) | ||
init_op = block.ops[0] | ||
self.assertEqual(init_op.type, 'uniform_random') | ||
receptive_field_size = float(15 * 20) | ||
limit = np.sqrt(6.0 / ( | ||
(param.shape[0] + param.shape[1]) * receptive_field_size)) | ||
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA) | ||
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA) | ||
self.assertEqual(init_op.attr('seed'), 0) | ||
|
||
def test_normal_xavier_initializer(self): | ||
"""Test Xavier initializer with normal distribution on | ||
for matrix multiply. | ||
""" | ||
program = framework.Program() | ||
block = program.global_block() | ||
param = block.create_parameter( | ||
dtype="float32", | ||
shape=[5, 10], | ||
lod_level=0, | ||
name="param", | ||
initializer=initializer.XavierInitializer(uniform=False)) | ||
self.assertEqual(len(block.ops), 1) | ||
init_op = block.ops[0] | ||
self.assertEqual(init_op.type, 'gaussian_random') | ||
std = np.sqrt(2.0 / (param.shape[0] + param.shape[1])) | ||
self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA) | ||
self.assertAlmostEqual(init_op.attr('std'), std, delta=DELTA) | ||
self.assertEqual(init_op.attr('seed'), 0) | ||
|
||
def test_normal_xavier_initializer_conv(self): | ||
"""Test Xavier initializer with normal distribution on | ||
for convolutions. | ||
""" | ||
program = framework.Program() | ||
block = program.global_block() | ||
param = block.create_parameter( | ||
dtype="float32", | ||
shape=[5, 10, 15, 20], | ||
lod_level=0, | ||
name="param", | ||
initializer=initializer.XavierInitializer(uniform=False)) | ||
self.assertEqual(len(block.ops), 1) | ||
init_op = block.ops[0] | ||
self.assertEqual(init_op.type, 'gaussian_random') | ||
receptive_field_size = float(15 * 20) | ||
std = np.sqrt(2.0 / ( | ||
(param.shape[0] + param.shape[1]) * receptive_field_size)) | ||
self.assertAlmostEqual(init_op.attr('mean'), 0.0, delta=DELTA) | ||
self.assertAlmostEqual(init_op.attr('std'), std, delta=DELTA) | ||
self.assertEqual(init_op.attr('seed'), 0) | ||
|
||
def test_xavier_initializer_supplied_arguments(self): | ||
"""Test the Xavier initializer with supplied arguments | ||
""" | ||
program = framework.Program() | ||
block = program.global_block() | ||
block.create_parameter( | ||
dtype="float32", | ||
shape=[5, 10], | ||
lod_level=0, | ||
name="param", | ||
initializer=initializer.XavierInitializer( | ||
fan_in=12, fan_out=23, seed=134)) | ||
self.assertEqual(len(block.ops), 1) | ||
init_op = block.ops[0] | ||
self.assertEqual(init_op.type, 'uniform_random') | ||
limit = np.sqrt(6.0 / (12 + 23)) | ||
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA) | ||
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA) | ||
self.assertEqual(init_op.attr('seed'), 134) | ||
|
||
|
||
if __name__ == '__main__': | ||
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended scale/std for different nonlinearity functions are different. I think we can borrow the idea from PyTorch to make this initializer more general.
/~https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L8
/~https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original paper does not talk about Relus and gain because this paper was before the time when RELUs became popular. I also looked at tensorflow and keras source code. They have kept this initialization as it is defined in the paper. Maybe we can merge this for now and then add the gain later in a separate PR when we have more knowledge about it. I can read few more papers to see how the gain attribute is used. I do not think it is right to borrow the idea directly without researching it thoroughly. We can merge this and can look at this in more detail after refactoring is complete. What do you suggest?