Instructions to use FlashVL/FlashVL-2B-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use FlashVL/FlashVL-2B-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="FlashVL/FlashVL-2B-Dynamic", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("FlashVL/FlashVL-2B-Dynamic", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use FlashVL/FlashVL-2B-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "FlashVL/FlashVL-2B-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FlashVL/FlashVL-2B-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/FlashVL/FlashVL-2B-Dynamic

SGLang

How to use FlashVL/FlashVL-2B-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "FlashVL/FlashVL-2B-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FlashVL/FlashVL-2B-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "FlashVL/FlashVL-2B-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "FlashVL/FlashVL-2B-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use FlashVL/FlashVL-2B-Dynamic with Docker Model Runner:
```
docker model run hf.co/FlashVL/FlashVL-2B-Dynamic
```

FlashVL-2B-Dynamic / README.md

FlashVL

Upload folder using huggingface_hub

7230176 verified about 1 year ago

preview code

raw

history blame contribute delete

5.31 kB

	---
	license: apache-2.0
	datasets:
	- lmms-lab/LLaVA-OneVision-Data
	- BAAI/Infinity-MM
	language:
	- en
	- zh
	base_model:
	- apple/aimv2-huge-patch14-448
	- Qwen/Qwen2-1.5B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# FlashVL-2B-Dynamic
	[\[📜 FlashVL\]](https://www.arxiv.org/abs/2505.09498)

	![image/png](https://s3plus.meituan.net/automl-datasets/mlm/logo.jpg)

	## Introduction

	We are excited to introduce FlashVL, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.


	### Environment Setup

	```bash
	pip install torch==2.1.2
	pip install transformers==4.50.0.dev0
	```


	### How to use it?

	```python
	import torch
	from PIL import Image
	import requests
	from io import BytesIO
	from transformers import AutoModel, AutoTokenizer, CLIPImageProcessor

	model_path = "FlashVL/FlashVL-2B-Dynamic"
	model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,trust_remote_code=True,device_map='cuda')
	model.tokenizer = AutoTokenizer.from_pretrained(model_path,device_map='cuda')
	model.im_trans = CLIPImageProcessor.from_pretrained(model_path)

	# single-image single-round conversation (单图单轮对话)
	image_url ="https://s3plus.meituan.net/automl-datasets/mlm/0516.png"
	response = requests.get(image_url)
	image_data = BytesIO(response.content)
	pil_image = Image.open(image_data).convert('RGB')
	messages = [{'role': 'user', 'content': "生成图中菜品的菜谱"}] # answer: EXTRA
	answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=256)
	print(answer)

	# single-image multi-round conversation (单图多轮对话)
	messages = [
	{'role': 'user', 'content': '这是什么'},
	{"role": "assistant", "content": '这是一道看起来像是银耳莲子汤的甜品。\
	银耳是一种常见的食材，通常用于制作甜品和汤品，具有软糯的口感和清润的口感。莲 \
	子是莲子的干燥部分，常用于中医和食疗中，具有补脾止泻的功效。图片中还可以看到 \
	一些枸杞和核桃，枸杞富含维生素和抗氧化物质，核桃则提供丰富的蛋白质和健康脂肪。 \
	整体来看，这道甜品不仅美味，还具有一定的营养价值。'},
	{'role': 'user', 'content': '对图中菜品卡路里分析'}
	]
	answer = model.chat(pil_image, messages, do_sample=False, max_new_tokens=512)
	print(answer)

	# pure-text single-round conversation (纯文本对话）
	messages = [{'role': 'user', 'content': "who are you"}]
	answer = model.chat(None, messages, do_sample=False, max_new_tokens=256)
	print(answer)

	```

	### Evaluation

	\| Benchmark \| Qwen2-VL-2B \| Aquila-VL-2B \| InternVL2.5-2B \| Flash-VL-2B<sub>s<sub> \| Flash-VL-2B<sub>d<sub> \| Flash-VL-2B<sub>d-ISS<sub> \|
	\| :-------------: \| :-------------: \| :-------------: \| :-------------: \|:-------------: \|:-------------: \|:-------------: \|
	\| MMMU<sub>val<sub> \| 41.9 \| 44.4 \| 41.8 \| 43.6 \| 42.9 \| 42.9 \|
	\| MMBench<sup>en<sup> \| 74.9 \| 78.6 \| 74.7 \| 78.4 \| 78.4 \| 79.1 \|
	\| MMBench<sup>cn<sup> \| 73.5 \| 76.3 \| 71.6 \| 74.7 \| 74.9 \| 76.7 \|
	\| MMStar \| 48.0 \| 54.9 \| 54.1 \| 53.8 \| 54.4 \| 54.1 \|
	\| MathVista<sub>testmini<sub> \| 43.0 \| 59.4 \| 50.9 \| 59.3 \| 58.1 \| 61.5 \|
	\| AI2D<sub>test<sub> \| 74.1 \| 75.0 \| 75.1 \| 74.2 \| 74.1 \| 74.4 \|
	\| MMVet \| 49.5 \| 40.9 \| 61.7 \| 47.3 \| 52.7 \| 50.7 \|
	\| HallusionBench \| 39.2 \| 38.5 \| 42.7 \| 43.5 \| 45.5 \| 49.0 \|
	\| OCRBench \| 794 \| 773 \| 800 \| 764 \| 831 \| 843 \|
	\| MME \| 1872 \| 1813 \| 2091 \| 1715 \| 1866 \| 1850 \|
	\| SEEDBench \| 71.5 \| 78.9 \| 73.2 \| 73.6 \| 73.6 \| 74.5 \|
	\| Average \| 60.2 \| 62.6 \| 63.6 \| 62.4 \| 64.0 \| 64.8 \|


	We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) to evaluate FlashVL-2B-Static.



	## Citation
	If you find this project useful in your research, please consider citing:

	```BibTeX
	@misc{zhang2025flashvl2boptimizingvisionlanguage,
	title={Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput},
	author={Bo Zhang and Shuo Li and Runhe Tian and Yang Yang and Jixin Tang and Jinhao Zhou and Lin Ma},
	year={2025},
	eprint={2505.09498},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2505.09498},
	}
	```