Image-Text-to-Text
Transformers
Safetensors
English
Chinese
feature-extraction
conversational
custom_code
Instructions to use FlashVL/FlashVL-2B-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use FlashVL/FlashVL-2B-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="FlashVL/FlashVL-2B-Dynamic", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("FlashVL/FlashVL-2B-Dynamic", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use FlashVL/FlashVL-2B-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "FlashVL/FlashVL-2B-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FlashVL/FlashVL-2B-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/FlashVL/FlashVL-2B-Dynamic
- SGLang
How to use FlashVL/FlashVL-2B-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "FlashVL/FlashVL-2B-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FlashVL/FlashVL-2B-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "FlashVL/FlashVL-2B-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "FlashVL/FlashVL-2B-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use FlashVL/FlashVL-2B-Dynamic with Docker Model Runner:
docker model run hf.co/FlashVL/FlashVL-2B-Dynamic
| license: apache-2.0 | |
| datasets: | |
| - lmms-lab/LLaVA-OneVision-Data | |
| - BAAI/Infinity-MM | |
| language: | |
| - en | |
| - zh | |
| base_model: | |
| - apple/aimv2-huge-patch14-448 | |
| - Qwen/Qwen2-1.5B-Instruct | |
| pipeline_tag: image-text-to-text | |
| library_name: transformers | |
| # FlashVL-2B-Dynamic | |
| [\[📜 FlashVL\]](https://www.arxiv.org/abs/2505.09498) | |
|  | |
| ## Introduction | |
| We are excited to introduce **FlashVL**, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications. | |
| ### Environment Setup | |
| ```bash | |
| pip install torch==2.1.2 | |
| pip install transformers==4.50.0.dev0 | |
| ``` | |
| ### How to use it? | |
| ```python | |
| import torch | |
| from PIL import Image | |
| import requests | |
| from io import BytesIO | |
| from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor | |
| model_path = "FlashVL/FlashVL-2B-Dynamic" | |
| model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda') | |
| model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda') | |
| model.im_trans = CLIPImageProcessor.from_pretrained(model_path) | |
| # single-image single-round conversation (单图单轮对话) | |
| image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png" | |
| response = requests.get(image_url) | |
| image_data = BytesIO(response.content) | |
| pil_image = Image.open(image_data).convert('RGB') | |
| messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}] # answer: EXTRA | |
| answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256) | |
| print(answer) | |
| # single-image multi-round conversation (单图多轮对话) | |
| messages = [ | |
| {'role': 'user', 'content': '这是什么'}, | |
| {"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\ | |
| 银耳是一种常见的食材,通常用于制作甜品和汤品,具有软糯的口感和清润的口感。莲 \ | |
| 子是莲子的干燥部分,常用于中医和食疗中,具有补脾止泻的功效。图片中还可以看到 \ | |
| 一些枸杞和核桃,枸杞富含维生素和抗氧化物质,核桃则提供丰富的蛋白质和健康脂肪。 \ | |
| 整体来看,这道甜品不仅美味,还具有一定的营养价值。'}, | |
| {'role': 'user', 'content': '对图中菜品卡路里分析'} | |
| ] | |
| answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=512) | |
| print(answer) | |
| # pure-text single-round conversation (纯文本对话) | |
| messages = [{'role': 'user', 'content': "who are you"}] | |
| answer = model.chat(None, messages, do_sample=False, max_new_tokens=256) | |
| print(answer) | |
| ``` | |
| ### Evaluation | |
| | Benchmark | Qwen2-VL-2B | Aquila-VL-2B | InternVL2.5-2B | Flash-VL-2B<sub>s<sub> | Flash-VL-2B<sub>d<sub> | Flash-VL-2B<sub>d-ISS<sub> | | |
| | :-------------: | :-------------: | :-------------: | :-------------: |:-------------: |:-------------: |:-------------: | | |
| | MMMU<sub>val<sub> | 41.9 | 44.4 | 41.8 | 43.6 | 42.9 | 42.9 | | |
| | MMBench<sup>en<sup> | 74.9 | 78.6 | 74.7 | 78.4 | 78.4 | 79.1 | | |
| | MMBench<sup>cn<sup> | 73.5 | 76.3 | 71.6 | 74.7 | 74.9 | 76.7 | | |
| | MMStar | 48.0 | 54.9 | 54.1 | 53.8 | 54.4 | 54.1 | | |
| | MathVista<sub>testmini<sub> | 43.0 | 59.4 | 50.9 | 59.3 | 58.1 | 61.5 | | |
| | AI2D<sub>test<sub> | 74.1 | 75.0 | 75.1 | 74.2 | 74.1 | 74.4 | | |
| | MMVet | 49.5 | 40.9 | 61.7 | 47.3 | 52.7 | 50.7 | | |
| | HallusionBench | 39.2 | 38.5 | 42.7 | 43.5 | 45.5 | 49.0 | | |
| | OCRBench | 794 | 773 | 800 | 764 | 831 | 843 | | |
| | MME | 1872 | 1813 | 2091 | 1715 | 1866 | 1850 | | |
| | SEEDBench | 71.5 | 78.9 | 73.2 | 73.6 | 73.6 | 74.5 | | |
| | Average | 60.2 | 62.6 | 63.6 | 62.4 | 64.0 | 64.8 | | |
| We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate FlashVL-2B-Static. | |
| ## Citation | |
| If you find this project useful in your research, please consider citing: | |
| ```BibTeX | |
| @misc{zhang2025flashvl2boptimizingvisionlanguage, | |
| title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput}, | |
| author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma}, | |
| year={2025}, | |
| eprint={2505.09498}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2505.09498}, | |
| } | |
| ``` |