-
Notifications
You must be signed in to change notification settings - Fork 113
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #122 from ninehills/create-pull-request/patch
Changes by create-pull-request action
- Loading branch information
Showing
2 changed files
with
31 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# DeepSeek R1 阅读清单 | ||
|
||
> Author: **ninehills** | ||
> Labels: **blog** | ||
> Created: **2025-01-29T04:48:48Z** | ||
> Link and comments: </~https://github.com/ninehills/blog/issues/121> | ||
|
||
随着 DeepSeek R1 的发布,如果想复刻 R1 或者在某个领域实践 RFT(Reinforcement Fine-Tuning),可以看看我整理的清单,会持续更新。 | ||
同时我个人尝试的结果也会更新上。 | ||
|
||
> 更新时间:2025.1.29 | ||
- 论文 | ||
- [DeepSeek R1](/~https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf):DeepSeek R1 本体论文,写的引人入胜。 | ||
- [Kimi K1.5](https://arxiv.org/pdf/2501.12599v1):Kimi K1.5 推理模型的思路和 R1 类似,在数据和奖励函数上有更多的细节。 | ||
- [DeepSeek Math](https://arxiv.org/pdf/2402.03300):GRPO 算法的提出,GRPO 相比于 PPO 节约了 Value Model,从而降低了训练的显存要求。 | ||
- GRPO 开源实现:主要是要支持 reward function。 | ||
- [trl grpo trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer):TRL 的 GRPOTrainer 实现,目前尚未发版,需要安装 trl 的 main 分支。 | ||
- [veRL](/~https://github.com/volcengine/verl):字节开源的 RL 实现,也支持 GRPO reward function。 | ||
- R1 复刻项目、数据集 | ||
- [open-r1](/~https://github.com/huggingface/open-r1/):包括数据合成、SFT、GRPO RL 的代码。 | ||
- [TinyZero](/~https://github.com/Jiayi-Pan/TinyZero):在简单的类24点问题上复刻 R1 RL 范式。 | ||
- [SkyT1](/~https://github.com/NovaSky-AI/SkyThought):蒸馏的 QwQ 的数据实现的 o1-like 模型,方法还是 reward model,但是代码还值得参考。 | ||
- [HuatuoGPT-o1](/~https://github.com/FreedomIntelligence/HuatuoGPT-o1):医学领域复刻 o1(开放代码、数据、论文和模型),但是用的还是 reward model,效果提升很少。可以用 R1 RL 范式看看能否有明显提升。 | ||
- [simpleRL-reason](/~https://github.com/hkust-nlp/simpleRL-reason):在 8k MATH 数据集上复刻 R1-Zero 的范式 | ||
- [open-r1-multimodal](/~https://github.com/EvolvingLMMs-Lab/open-r1-multimodal):R1 多模态的复刻项目 | ||
- [open-thoughts](/~https://github.com/open-thoughts/open-thoughts):**【重点】** 最成熟的 R1 复刻项目,已经发布了 [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k) 和 [OpenThoughts-114k dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 项目,仅经过 SFT 即可以逼近 R1-distill 模型 | ||
- [R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT):1.68M 条 R1 蒸馏数据集 | ||
- [grpo_demo.py](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb):基于 0.5B 模型的 RL demo,可以用来学习怎么训练。 |