-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support offload in sharding stage2 #37904
support offload in sharding stage2 #37904
Conversation
Thanks for your contribution! |
if param.name not in self._master_params.keys(): | ||
self._master_params[param.name] = core.VarBase( | ||
name=param.name, | ||
value=param.cast(dtype=Type.fp32.value).numpy(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个也改成.value().get_tensor()吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
|
||
for param in self._local_params: | ||
if param.name in self._master_params.keys(): | ||
param.set_value(self._master_params[param.name].cuda(dev_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方会增加显存,需要先释放param,在shareddata master参数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Function optimization
PR changes
Others
Describe
Support offload, grad_clip and loss_scaler in dygraph sharding stage2
Optimize the performance of offload in PR-38064
PaddleNLP GPT-3模型,sharding stage2+pfp16 with/without offload:
1> PaddleNLP GPT-3模型 0.31B参数量
单机两卡,sharding stage2+pfp16 without offload,峰值显存为 5319 MiB,显存变化曲线为:
单机两卡,sharding stage2+pfp16 with offload,峰值显存为 3137 MiB(节省 2182 MiB,约 41%),显存变化曲线为:
2> PaddleNLP GPT-3模型 1.02B参数量
单机两卡,sharding stage2+pfp16 without offload,峰值显存为 11941 MiB
单机两卡,sharding stage2+pfp16 with offload,峰值显存为 5369 MiB(节省 6572 MiB,约 55%)