You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 12, 2023. It is now read-only.
PPO训练代码里,外层手动做了一下梯度累计的循环,梯度累计和学习率衰减的具体操作在TRL库的代码里,这里有几个疑惑的地方。
1、TRL库从0.4.5版本后,在梯度累计部分增加了梯度清零的代码:
/~https://github.com/lvwerra/trl/blob/388bdc03ac40a42dfb77dbbc416b31a3d076b18e/trl/trainer/ppo_trainer.py#L685
按照ppo_config的默认设置,在ppo_epoch=1,batch_size=mini_batch_size=training_args.per_device_train_batch_size的情况下,梯度累计是不是不生效,之前的梯度都清零了,只有最后一次反向传播的梯度保留了。
2、学习率衰减上,TRL库的学习率衰减操作是在这里进行的:/~https://github.com/lvwerra/trl/blob/388bdc03ac40a42dfb77dbbc416b31a3d076b18e/trl/trainer/ppo_trainer.py#L746
但是在ppo的workflow.py代码里,训练步的计算是考虑了梯度累计的,这样会不会导致训练步对不上。实际在梯度累计的时候,每次循环都进行了学习率衰减。
The text was updated successfully, but these errors were encountered: