-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the Xavier Initializer #5270
Conversation
fan_out = shape[1] | ||
else: | ||
# Assume this to be a convolutional kernel | ||
# In Paddlepaddle, the shape of the kernel is like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Paddlepaddle/PaddlePaddle/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4e1fa1b
(http://proceedings.mlr.press/v9/glorot10a.html) | ||
""" | ||
|
||
def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normal distribution is used more frequently than Uniform distribution. Shall we change the default value of uniform
to False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think uniform distribution is acceptable for me, my concern is we are not in a stage that we are intensively encoding the best practices from the community into our framework. We can carefully tune this later.
But uniform distribution from [-1, 1] as a default initialization seems terrible (only from my personal experience. It seems too large). Maybe, we can reduce the region limit into le-3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at tensorflow where they keep uniform as the default. That is why I chose uniform as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it~ thank you for the information.
def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0): | ||
"""Constructor for XavierInitializer | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add descriptions for the default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. Could you give me an example of the description you are referring to. I have talked about fan_in and fan_out in the Args
section of my docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just ignore this comment, I didn't realize that the signature of the function will be displayed in the doc.
For example, "uniform: .... Default value is True.". Otherwise, one may need to dig into the code to find what are the default values.
f_in, f_out = self._compute_fans(var) | ||
|
||
# If fan_in and fan_out are passed, use them | ||
fan_in = self._fan_in or f_in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f_in
will be used if self._fan_in
is 0, which is counter-intuitive. Therefore I think f_in if self._fan_in is None else self._fan_in
. 0 is not a proper value for self._fan_in
, which means, when encountering a 0, either the user is using it intentionally for debugging purpose or an error occurs. Therefore, I think failing deterministically will be a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4e1fa1b
|
||
# If fan_in and fan_out are passed, use them | ||
fan_in = self._fan_in or f_in | ||
fan_out = self._fan_out or f_out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
outputs={"Out": var}, | ||
attrs={ | ||
"shape": var.shape, | ||
"data_type": int(var.data_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with Block. Is int
correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is correct. We pass the data type as integer which is then mapped to a given type.
limit = np.sqrt(6.0 / (param.shape[0] + param.shape[1])) | ||
self.assertAlmostEqual(init_op.attr('min'), -limit, delta=DELTA) | ||
self.assertAlmostEqual(init_op.attr('max'), limit, delta=DELTA) | ||
self.assertEqual(init_op.attr('seed'), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also test whether seed can be set properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4e1fa1b
approximately same in all the layers. In case of Uniform distribution, | ||
the range is [-x, x], where x = sqrt(6 / (fan_in + fan_out)). | ||
In case of Normal distribution, the mean is 0 and the standard deviation | ||
is sqrt(2/ (fan_in + fan_out)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recommended scale/std for different nonlinearity functions are different. I think we can borrow the idea from PyTorch to make this initializer more general.
/~https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L8
/~https://github.com/pytorch/pytorch/blob/master/torch/nn/init.py#L184
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original paper does not talk about Relus and gain because this paper was before the time when RELUs became popular. I also looked at tensorflow and keras source code. They have kept this initialization as it is defined in the paper. Maybe we can merge this for now and then add the gain later in a separate PR when we have more knowledge about it. I can read few more papers to see how the gain attribute is used. I do not think it is right to borrow the idea directly without researching it thoroughly. We can merge this and can look at this in more detail after refactoring is complete. What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to read your doc/comments, it is quite easy to follow, thank you.
networks. International conference on artificial intelligence and | ||
statistics. | ||
(http://proceedings.mlr.press/v9/glorot10a.html) | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like to read your comments~ Quite easy to follow.
Just for my personal interests. I remember this is a recommended initializer for tanh, the usually recommended initializer varies according to different activations.
I think we can just add all these helpful initializers into our framework for the first goal, and later to choose a much better initialization strategy to encode the best practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback. I agree with you.
(http://proceedings.mlr.press/v9/glorot10a.html) | ||
""" | ||
|
||
def __init__(self, uniform=True, fan_in=None, fan_out=None, seed=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think uniform distribution is acceptable for me, my concern is we are not in a stage that we are intensively encoding the best practices from the community into our framework. We can carefully tune this later.
But uniform distribution from [-1, 1] as a default initialization seems terrible (only from my personal experience. It seems too large). Maybe, we can reduce the region limit into le-3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Not changing the default uniform distribution is acceptable for me. (Personally) Because for me, this is not a top concern currently and I do not have a strong evidence to support any other change (they all seems reasonable). |
No description provided.