| | --- |
| | license: apache-2.0 |
| | library_name: transformers |
| | pipeline_tag: image-text-to-text |
| | tags: |
| | - multimodal |
| | - reasoning |
| | - math |
| | - r1 |
| | --- |
| | |
| | # Vision-R1-32B |
| |
|
| | Vision-R1-32B is a multimodal reasoning model introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749). It is based on the Qwen2.5-VL-32B architecture and is specifically optimized to enhance reasoning capabilities (such as self-reflection and questioning) in multimodal tasks. |
| |
|
| | - **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749) |
| | - **Repository:** [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1) |
| |
|
| | ## Model Description |
| |
|
| | Vision-R1 addresses the difficulty of activating complex reasoning in MLLMs without human-annotated reasoning data. The model was developed using a two-stage pipeline: |
| | 1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold). |
| | 2. **Reinforcement Learning (RL)**: Utilizing Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy. This strategy gradually increases the reasoning length (4K -> 8K -> 16K) to refine the model's ability to learn complex reasoning processes. |
| |
|
| | ## Performance |
| |
|
| | Vision-R1-32B demonstrates strong performance across various multimodal math reasoning benchmarks, significantly outperforming its base model: |
| |
|
| | | Model | MathVista | MathVerse | MathVerse (mini) | MM-Math | DynaMath (Avg) | AVG. | |
| | | -------------------------- | ----------- | ------------ | ---------------- | ------------ | -------------- | ------------ | |
| | | Qwen2.5-VL-32B | 72.9 | 52.3 | 47.6 | 34.9 | 55.5 | 52.6 | |
| | | **Vision-R1-32B (Ours)** | **76.4** | **62.1** | **59.0** | **55.3** | **65.6** | **63.7** | |
| |
|
| | ## Quickstart |
| |
|
| | ### Inference via Transformers |
| |
|
| | You can use the inference script provided in the [official repository](https://github.com/Osilly/Vision-R1). |
| |
|
| | ```bash |
| | # Inference script for Vision-R1-32B model |
| | MODEL_PATH="Osilly/Vision-R1-32B" |
| | IMAGE_PATH="path/to/your/image.png" |
| | PROMPT="Your math problem or question here." |
| | |
| | python3 inference.py \ |
| | --model_path ${MODEL_PATH} \ |
| | --enable_flash_attn True \ |
| | --image_path ${IMAGE_PATH} \ |
| | --prompt "${PROMPT}" \ |
| | --max_tokens 4096 \ |
| | --temperature 0.6 \ |
| | --top_p 0.95 |
| | ``` |
| |
|
| | The model is also compatible with **vLLM** (version > 0.7.2) for faster deployment and local inference. |
| |
|
| | ## Citation |
| |
|
| | If you find Vision-R1 useful, please cite the following paper: |
| |
|
| | ```bibtex |
| | @article{huang2025visionr1, |
| | title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models}, |
| | author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui}, |
| | journal={arXiv preprint arXiv:2503.06749}, |
| | year={2025} |
| | } |
| | ``` |