EVA: Efficient Reinforcement Learning for End-to-End Video Agent
This repository contains the official evaluation code for the model proposed in our paper. The code is available on GitHub and the model weights are available on Hugging Face.
1. Paper and Model
- Paper Title:
EVA: Efficient Reinforcement Learning for End-to-End Video Agent - Paper Link:
https://arxiv.org/abs/2603.22918 - GitHub Repository:
https://github.com/wangruohui/EfficientVideoAgent - Model Link:
https://huggingface.co/WRHC/EfficientVideoAgent/
2. Reference Results
Reference result files are provided in this repository, under results-12k.
You can compute accuracy with accuracy.py:
python accuracy.py <result_jsonl_path>
Main results:
| Dataset | Acc | Round | Token |
|---|---|---|---|
| VideoMME | 60.15 | 2.42 | 16911 |
| LongVideoBench | 54.97 | 2.57 | 19042 |
| MLVU | 68.26 | 2.42 | 16570 |
| LSDBench | 49.31 | 2.48 | 13914 |
| VideoHolmes | 37.18 | 2.75 | 9085 |
| LVBench | 43.32 | 2.62 | 20412 |
Token includes both text tokens and image tokens.
3. Run Your Own Evaluation
Step 1. Clone the Repository
git clone https://github.com/wangruohui/EfficientVideoAgent.git
cd EfficientVideoAgent
Step 2. Download Model and Install Dependencies
Download model weights from
https://huggingface.co/WRHC/EfficientVideoAgent/tohf_model/:huggingface-cli download WRHC/EfficientVideoAgent --local-dir hf_modelInstall FFmpeg following
https://www.ffmpeg.org/download.html, ensureffprobeis inPATH, and ensure FFmpeg shared libraries are inLD_LIBRARY_PATH.Install dependencies from
requirements.txt(recommended:uv)uv venv .venv source .venv/bin/activate uv pip install -r requirements.txt
Step 3. Download Evaluation Datasets and Update Dataset Paths
eval-eva.py reads dataset meta from DATASET_CONFIG. Before running evaluation, make sure each dataset is available locally and paths are correct.
- Download and extract video datasets (VideoMME / LSDBench / LVBench / VideoHolmes / LongVideoBench / MLVU).
- Annotation jsonl files are already provided in
data/*.jsonland have been normalized to a unified format. - Edit
eval-eva.py->DATASET_CONFIG: onlyvideo_rootneeds to be changed to your local video directory.
Example:
DATASET_CONFIG = {
"videomme": {
"jsonl": "data/videomme_test_wosubtitles_raw_list_full.jsonl",
"video_root": "/path/to/VideoMME/video",
"cache": "cache_videomme.jsonl",
"result": "result_videomme.jsonl",
},
}
Step 4. Serve the Model with vLLM (Multi-GPU Data Parallel)
vllm serve <MODEL_PATH_OR_HF_ID> \
--data-parallel-size <NUM_GPUS> \
--limit-mm-per-prompt '{"image": 9999, "video":0}' \
--mm_processor_cache_gb 20 \
--attention-backend FLASH_ATTN \
--allowed-local-media-path <LOCAL_MEDIA_ROOT>
Reproducibility Note
With vLLM, even when temperature=0, final accuracy can still fluctuate by around 0.x% across runs.
Step 5. Configure eval-eva.py Runtime Settings and Run Evaluation
Before running, edit the config section at the top of eval-eva.py:
BASE_URL: OpenAI-compatible endpoint for your vLLM server (for example,http://localhost:8000/v1).API_KEY: API key used by the client (can be a dummy value for local vLLM setups if authentication is disabled).MODEL_TOKENIZER_PATH: Tokenizer path, should pointing to downloaded hf model weights, i.e.https://huggingface.co/WRHC/EfficientVideoAgent/in step 2.FRAME_TOOL_PATH: path to the frame selection tool script (default isselect_frame_fallback.py).FRAME_SAVE_ROOT: directory where extracted frames are saved during tool calls. Also make sure:FRAME_SAVE_ROOTdirectory exists and is writable (or set it to a writable path).- vLLM
--allowed-local-media-pathcovers your datasetvideo_rootdirectories.
DATASET_CONFIG: per-dataset I/O configuration.DATASET_CONFIG[*].video_root: root directory containing raw video files.DATASET_CONFIG[*].cache: incremental cache file used during running.DATASET_CONFIG[*].result: final merged output file written at the end.
Run one dataset:
python eval-eva.py --dataset videomme
python eval-eva.py --dataset lsdbench
python eval-eva.py --dataset lvbench
python eval-eva.py --dataset videoholmes
python eval-eva.py --dataset longvideobench
python eval-eva.py --dataset mlvu
You can control per-tool-call visual token budget via -v/--max-visual-tokens.
When a tool call exceeds this budget, eval-eva.py automatically reduces resolution and frame count before extraction.
python eval-eva.py --dataset videomme -v 12000
python eval-eva.py --dataset videomme -v 32000
Run all supported datasets with batch.sh:
bash batch.sh
4. Output Files and Cache/Resume Mechanism
- Output naming is controlled by
DATASET_CONFIGineval-eva.py. - If the process is interrupted, rerunning the same command resumes from cache and skips finished samples.
- By default, each dataset writes:
cache_*.jsonl: online cache (appended sample-by-sample)result_*.jsonl: final merged output
- Useful options:
--retry-error: retry only failed/error cached samples--new-cache: recreate cache from scratch--output-dir: redirect cache/result outputs to another directory
Citation
@misc{zhang2026evaefficientreinforcementlearning,
title={EVA: Efficient Reinforcement Learning for End-to-End Video Agent},
author={Yaolun Zhang and Ruohui Wang and Jiahao Wang and Yepeng Tang and Xuanyu Zheng and Haonan Duan and Hao Lu and Hanming Deng and Lewei Lu},
year={2026},
eprint={2603.22918},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.22918},
}
- Downloads last month
- -
