Skip to content

Video-R1: Towards Super Reasoning Ability in Video Understanding MLLMs

Notifications You must be signed in to change notification settings

tulerfeng/Video-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Video-R1: Towards Super Reasoning Ability in Video Understanding

This work aims to integrate deep thinking capabilities into video understanding tasks through the R1 paradigm.

For the first time, we achieved a simultaneous increase in both accuracy and thinking length in video understanding domain.

This is a preliminary repo, and we will continue to develop our Video-R1 model in the future.

Updates

  • [2025/02/23] We release training code and data of Video-R1

Findings

Shared Growth of Accuracy and Thinking Length is Possible in Video

In many previous multimodal R1 repositories, the thinking length either showed little to no increase (e.g., Open R1 Video ) or even decreased (e.g., R1-V ).

In this work, we demonstrate that this issue can be addressed by using an appropriate base model and a strong reasoning dataset. We train Qwen2-VL-7B-Instruct using GRPO with accuracy and format rewards on the DVD-counting dataset. Training the 7B model for 900 steps can be completed in approximately 10 hours using 4 x A100 (80G) GPUs. The training curve is as follows:

7B_curve

Weak Base Model Hinders the Emergence of Deep Thinking in Video

We train Qwen2-VL-2B-Instruct using the same setting on the DVD-counting dataset. In contrast, this model shows a decrease in thinking length.

In some cases, the model even skips the thinking process and outputs sentences like this: <think>\n</think>\n<answer>2</answer>.

2B_curve

Weak Reasoning Data Maybe Not Beneficial for Reinforcing Deep Thinking

We train Qwen2-VL-7B-Instruct on a subset of NExT-QA dataset with little reasoning. We can notice that there is almost no increase in thinking length. This indicates that reinforcing deep thinking may require strong reasoning data.

7B_nextqa

Datasets

The video files are in the zip file and the train/test splits are in the jsonl file.

🤗 Video-R1 Dataset: DVD-counting

This dataset is extracted from "DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue"

Performance

We can observe that RL training results in an accuracy boost of around 10% on DVD-counting-test

Dataset Qwen2-VL-7B-Instruct Video-R1-7B
DVD-counting-test 25.0 34.5

Reasoning Samples:

Descriptive alt text
Descriptive alt text

Set up

git clone /~https://github.com/tulerfeng/Video-R1
cd Video-R1

# build environment
conda create -n video-r1 python=3.11 
conda activate video-r1
bash setup.sh

# qwen video extraction setting
cd src/qwen-vl-utils
pip install -e .
cd ..

# download dataset
git lfs install
git clone https://huggingface.co/datasets/Video-R1/DVD-counting

Please put the downloaded dataset to src/r1-v/data/

Training

Train Qwen2-VL-7B-Instruct with GRPO

bash src/scripts/run_grpo_video.sh

Evaluation

Evaluation on video counting task

python ./src/eval/test_qwen2vl_video_counting.py

Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows:

About

Video-R1: Towards Super Reasoning Ability in Video Understanding MLLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published