R1-Multimodal-Journey

Environment

Follow the instruction in /~https://github.com/FanqingM/R1-Multimodal-Journey
Update the transformers to the 4.49.0.dev0 to support the Qwen2.5_VL
pip install vllm == 0.7.2 trl == 0.15.0.dev0 to support vLLM

What we did

Fix some bugs in Open-R1-Multimodal (such as function: get_per_token_logps)
Suppoprt vLLM to make the training process more faster (5-6x faster then OpenR1-MM)
Exploring the effectiveness of R1-like reinforcement learning (RL) on more challenging problems (geometry problems).
- We use 20k samples from geo170k for now

Some Findings

The "aha moment" can emerge even at the very beginning of training, even for a model as small as 3B and in the instruct model setting.

Here is a sample when we just training within 10 steps

Content: <think> In the second quadrant, the x-coordinate is negative and the y-coordinate is positive. The cosine of an angle in a right triangle is equal to the adjacent side over the hypotenuse. Since the cosine of $\alpha$ is $-\frac{3}{5}$, we can determine that the adjacent side (which is the x-coordinate) is $-3$ (since the hypotenuse is 1 for the unit circle). The hypotenuse is always 1, so the radius of the circle is 1. Using the Pythagorean theorem, we can find the y-coordinate (the opposite side), which is $y = \sqrt{1^2 - (-3)^2} = \sqrt{1 - 9} = \sqrt{-8}$. 
*****This gives us a complex number, indicating that there was a mistake in the initial assumption that we could solve for the y-coordinate directly from the cosine value without considering the unit circle and the quadrant.*****
Since $\cos \alpha = -\frac{3}{5}$ in the second quadrant, and knowing that $\cos^2 \alpha + \sin^2 \alpha = 1$, we can find $\sin \alpha$. $\sin \alpha = \sqrt{1 - \cos^2 \alpha} = \sqrt{1 - (-\frac{3}{5})^2} = \sqrt{1 - \frac{9}{25}} = \sqrt{\frac{16}{25}} = \frac{4}{5}$. Therefore, the coordinates of point A are $(-\frac{3}{5}, \frac{4}{5})$. </think><answer> The coordinates of point $A$ are $(-\frac{3}{5}, \frac{4}{5})$.

VLMs seem to struggle to exhibit length increase patterns similar to LLMs. In our experiments, we tested various datasets but failed to replicate the length increase curve observed in (simpleRL-reason)[/~https://github.com/hkust-nlp/simpleRL-reason]. However, performance improvements were indeed observed through evaluation.
Although R1's RL only provides rewards based on answer correctness, it still demonstrates higher data efficiency compared to SFT. We compared the results on the mathvista-testmini set.
We believe that the reason why VLM is difficult to achieve an R1 moment similar to LLM is the lack of high-quality data. Currently, multimodal reasoning data is significantly scarcer than language data. This makes it difficult for the model to show length growth on simple datasets such as geo170k, and it is easy to overfit.
We found that the slow speed of Openr1-Multimodel was due to slow generation, causing other processes to wait. To address this, we replaced it with vLLM, significantly reducing training time (in 8H100 on 20k Geo170k samples).

Data Preparation

Modify necessary input/output paths in local_scripts/gen_dataset.py
Prepare the dataset using python local_scripts/gen_dataset.py(unlike Open-R1-Multimodal that directly store the image in PIL format, we store image path(url or local path) to enable vLLM support)

Training

Modify necessary input/output paths in local_scripts/train_qwen2_5_3b.sh
sh local_scripts/train_qwen2_5_3b.sh
- The default setting for vLLM uses cuda:7 for generation while utilizing the remaining 7 nodes for training.

Evaluation

run python eval/evaluate_mathvista.py --checkpoint ${CHECKPOINT} --datasets MathVista_testmini to evaluate your ckpt on MathVista_testmini.

Acknowledgement

Our codebase builds upon Open-R1-Multimodal and integrates open-source contributions from vLLM, Open-R1, and trl. We also extend our gratitude to DeepSeek-R1 and Qwen2.5-VL for their open-source techniques and base models, which have enabled us to further our exploration.

Core Contributors (equal contribution with Alphabetical sorting)

Lingxiao Du, Xiangyan Liu, Fanqing Meng.

Project Leader

Wenqi Shao, Qiaosheng Zhang

Hiring Announcement

We are hiring interns at the Shanghai AI Lab, with a focus on multi-modality reasoning model. If you are interested, please contact weqish@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
configs		configs
eval		eval
local_scripts		local_scripts
slurm		slurm
src/open_r1		src/open_r1
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R1-Multimodal-Journey

About

Releases

Packages

Contributors 3

Languages

License

FanqingM/R1-Multimodal-Journey

Folders and files

Latest commit

History

Repository files navigation

R1-Multimodal-Journey

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages