Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Logic-RL-Lite is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for reasoning capabilities. It follows up Logic-RL project.

It leverages the following key components:

veRL Framework (developed by ByteDance)
Knights and Knaves (K&K) Logic Puzzle Dataset (provided by the Logic-RL project)
Small-Scale Base Models, including:
- Qwen2.5 (1.5B, 3B)
- Llama3.2 (3B)

Key Findings

1. Smallest Model Capable of Learning Reasoning via Pure RL

1.5B Models:
- Instruction-tuned or pretrained models cannot learn reasoning.
3B Models:
- Instruction-tuned models (e.g., Qwen2.5-3B) can learn reasoning.
- Pretrained-only models (e.g., Llama3.2-3B) struggle to learn reasoning.
- Hypothesis: Qwen2.5-3B-Pretrain is probably somewhat instruction tuned, making it significantly more capable than Llama3.2-3B-Pretrain.
7B Models and Larger:
- Models of this scale consistently and easily learn reasoning.

2. No "Aha Moment" During Pure RL

Self-reflection and rethinking behaviors appear at epoch 0 (or even step 0) in instruction-tuned base models.
These behaviors likely stem from instruction tuning, rather than emergent properties of pure RL.
This aligns with findings from oat-zero and .

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

While long CoT responses often lead to higher rewards, they do not necessarily translate to higher accuracy.
This phenomenon highlights the presence of superficial self-reflection in the model's reasoning processes.

4. Language Mixing and Nonsense Outputs

Instruction-Tuned Models:
- Rare instances of language mixing during reasoning tasks.
Pretrained Models:
- Prevalent language mixing and nonsensical outputs during reasoning tasks.

5. Stability of Reinforcement Learning Algorithms

Reinforce++ appears to be more stable than GRPO for fine-tuning reasoning capabilities with pure RL.
Further experiments are required to confirm this observation.

Acknowledgements

This project builds upon and references several open-source frameworks and datasets:

veRL Framework: Reinforcement learning framework.
Knights and Knaves Dataset: Logic puzzle dataset.
OAT-ZERO: Key insights and findings on reasoning with pure RL.
TinyZero: Implementation of reward models and Countdown task.
vLLM: Accelerated inference for large language models.
/~https://github.com/sail-sg/oat-zero /~https://github.com/volcengine/verl /~https://github.com/deepseek-ai/DeepSeek-R1 /~https://github.com/sail-sg/oat-zero /~https://github.com/Unakar/Logic-RL /~https://github.com/Jiayi-Pan/TinyZero /~https://github.com/agentica-project/deepscaler /~https://github.com/AlphaPav/mem-kk-logic

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Key Findings

1. Smallest Model Capable of Learning Reasoning via Pure RL

2. No "Aha Moment" During Pure RL

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

4. Language Mixing and Nonsense Outputs

5. Stability of Reinforcement Learning Algorithms

Acknowledgements

About

Releases

Packages

DolbyUUU/Logic-RL-Lite

Folders and files

Latest commit

History

Repository files navigation

Logic-RL-Lite: Lightweight Replication of DeepSeek-R1-Zero

Key Findings

1. Smallest Model Capable of Learning Reasoning via Pure RL

2. No "Aha Moment" During Pure RL

3. Longer Chain-of-Thought (CoT) ≠ Higher Accuracy

4. Language Mixing and Nonsense Outputs

5. Stability of Reinforcement Learning Algorithms

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages