Merge pull request #122 from ninehills/create-pull-request/patch

Changes by create-pull-request action
ninehills · Jan 29, 2025 · 7aa7f66 · 7aa7f66
2 parents 466a43e + d76d03c
commit 7aa7f66
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,6 @@
 # 九原山
 ## Posts
+- #121 [DeepSeek R1 阅读清单](articles/121.md) 2025-01-29 `blog`
 - #118 [Embedding Model Fine-Tuning 案例](articles/118.md) 2024-10-26 `blog`
 - #111 [中文 Emebedding & Reranker 模型选型](articles/111.md) 2023-12-28 `blog`
 - #109 [Gemini Pro Vision 作为 表格 OCR 解决方案的简单测试](articles/109.md) 2023-12-20 `blog`

diff --git a/articles/121.md b/articles/121.md
@@ -0,0 +1,30 @@
+# DeepSeek R1 阅读清单
+
+> Author: **ninehills**  
+> Labels: **blog**  
+> Created: **2025-01-29T04:48:48Z**  
+> Link and comments: </~https://github.com/ninehills/blog/issues/121>  
+
+
+随着 DeepSeek R1 的发布，如果想复刻 R1 或者在某个领域实践 RFT（Reinforcement Fine-Tuning），可以看看我整理的清单，会持续更新。
+同时我个人尝试的结果也会更新上。
+
+> 更新时间：2025.1.29
+
+- 论文
+	- [DeepSeek R1](/~https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf)：DeepSeek R1 本体论文，写的引人入胜。
+	- [Kimi K1.5](https://arxiv.org/pdf/2501.12599v1)：Kimi K1.5 推理模型的思路和 R1 类似，在数据和奖励函数上有更多的细节。
+	- [DeepSeek Math](https://arxiv.org/pdf/2402.03300)：GRPO 算法的提出，GRPO 相比于 PPO 节约了 Value Model，从而降低了训练的显存要求。
+- GRPO 开源实现：主要是要支持 reward function。
+	- [trl grpo trainer](https://huggingface.co/docs/trl/main/en/grpo_trainer)：TRL 的 GRPOTrainer 实现，目前尚未发版，需要安装 trl 的 main 分支。
+	- [veRL](/~https://github.com/volcengine/verl)：字节开源的 RL 实现，也支持 GRPO reward function。
+- R1 复刻项目、数据集
+	- [open-r1](/~https://github.com/huggingface/open-r1/)：包括数据合成、SFT、GRPO RL 的代码。
+	- [TinyZero](/~https://github.com/Jiayi-Pan/TinyZero)：在简单的类24点问题上复刻 R1 RL 范式。
+	- [SkyT1](/~https://github.com/NovaSky-AI/SkyThought)：蒸馏的 QwQ 的数据实现的 o1-like 模型，方法还是 reward model，但是代码还值得参考。
+	- [HuatuoGPT-o1](/~https://github.com/FreedomIntelligence/HuatuoGPT-o1)：医学领域复刻 o1（开放代码、数据、论文和模型），但是用的还是 reward model，效果提升很少。可以用 R1 RL 范式看看能否有明显提升。
+	- [simpleRL-reason](/~https://github.com/hkust-nlp/simpleRL-reason)：在 8k MATH 数据集上复刻 R1-Zero 的范式
+	- [open-r1-multimodal](/~https://github.com/EvolvingLMMs-Lab/open-r1-multimodal)：R1 多模态的复刻项目
+	- [open-thoughts](/~https://github.com/open-thoughts/open-thoughts)：**【重点】** 最成熟的 R1 复刻项目，已经发布了 [Bespoke-Stratos-17k dataset](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k) 和 [OpenThoughts-114k dataset](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) 项目，仅经过 SFT 即可以逼近 R1-distill 模型
+	- [R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT)：1.68M 条 R1 蒸馏数据集
+	- [grpo_demo.py](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)：基于 0.5B 模型的 RL demo，可以用来学习怎么训练。