Vision-R1-32B / README.md

nielsr HF Staff

Add model card for Vision-R1-32B

930e07b verified 9 days ago

3.05 kB

license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - reasoning
  - math
  - r1

Vision-R1-32B

Vision-R1-32B is a multimodal reasoning model introduced in the paper Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models. It is based on the Qwen2.5-VL-32B architecture and is specifically optimized to enhance reasoning capabilities (such as self-reflection and questioning) in multimodal tasks.

Paper: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Repository: https://github.com/Osilly/Vision-R1

Model Description

Vision-R1 addresses the difficulty of activating complex reasoning in MLLMs without human-annotated reasoning data. The model was developed using a two-stage pipeline:

Cold-start Initialization: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold).
Reinforcement Learning (RL): Utilizing Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy. This strategy gradually increases the reasoning length (4K -> 8K -> 16K) to refine the model's ability to learn complex reasoning processes.

Performance

Vision-R1-32B demonstrates strong performance across various multimodal math reasoning benchmarks, significantly outperforming its base model:

Model	MathVista	MathVerse	MathVerse (mini)	MM-Math	DynaMath (Avg)	AVG.
Qwen2.5-VL-32B	72.9	52.3	47.6	34.9	55.5	52.6
Vision-R1-32B (Ours)	76.4	62.1	59.0	55.3	65.6	63.7

Quickstart

Inference via Transformers

You can use the inference script provided in the official repository.

# Inference script for Vision-R1-32B model
MODEL_PATH="Osilly/Vision-R1-32B"
IMAGE_PATH="path/to/your/image.png"
PROMPT="Your math problem or question here."

python3 inference.py \
    --model_path ${MODEL_PATH}  \
    --enable_flash_attn True \
    --image_path ${IMAGE_PATH} \
    --prompt "${PROMPT}" \
    --max_tokens 4096 \
    --temperature 0.6 \
    --top_p 0.95

The model is also compatible with vLLM (version > 0.7.2) for faster deployment and local inference.

Citation

If you find Vision-R1 useful, please cite the following paper:

@article{huang2025visionr1,
  title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
  journal={arXiv preprint arXiv:2503.06749},
  year={2025}
}