V-STaR Leaderboard
"Can Video-LLMs โreason through a sequential spatio-temporal logicโ in videos?"
๐ Welcome to the leaderboard of the V-STaR! ๐ฆ A spatio-temporal reasoning benchmark for Video-LLMs
- Comprehensive Dimensions: We evaluate Video-LLMโs spatio-temporal reasoning ability in answering questions explicitly in the context of โwhenโ, โwhereโ, and โwhatโ.
- Human Alignment: We conducted extensive experiments and human annotations to validate robustness of V-STaR.
- New Metrics: We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains.
- Valuable Insights: V-STaR reveals a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning.
Join Leaderboard: Please contact us to update your results.
Credits: This leaderboard is updated and maintained by the team of V-STaR Contributors.
Model-Unnamed: 0_level_1 | All-mLGM | All-mAM | Short-mAM | Short-mLGM | Medium-mAM | Medium-mLGM | Long-mAM | Long-mLGM |
---|---|---|---|---|---|---|---|---|
Video-CCAM-v1.2 | 38.15 | 26.75 | 27.49 | 38.56 | 26.96 | 40.58 | 14.86 | 19.28 |
Model | LGM | AM | Score | Acc | R1@0.3 | R1@0.5 | R1@0.7 | m_tIoU | AP@0.1 | AP@0.3 | AP@0.5 | m_vIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Video-CCAM-v1.2 | 39.51 | 27.97 | 1.71 | 60.78 | 23.14 | 10.35 | 5.10 | 16.67 | 19.92 | 19.92 | 34.18 | 13.59 |
Model | LGM | AM | Score | Acc | R1@0.3 | R1@0.5 | R1@0.7 | m_tIoU | AP@0.1 | AP@0.3 | AP@0.5 | m_vIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 39.51 | 27.97 | 1.71 | 60.78 | 23.14 | 10.35 | 5.10 | 16.67 | 19.92 | 8.36 | 2.75 | 6.47 |
Gemini-2-Flash | 36.14 | 27.39 | 1.59 | 53.01 | 31.63 | 15.84 | 9.45 | 24.54 | 15.67 | 3.82 | 0.93 | 4.63 |
Qwen2.5-VL | 35.20 | 26.53 | 1.61 | 54.53 | 17.03 | 8.92 | 3.72 | 11.48 | 35.89 | 19.92 | 8.36 | 13.59 |
Video-CCAM-v1.2 | 30.51 | 20.28 | 1.75 | 59.35 | 1.15 | 0.00 | 0.00 | 1.50 | 0.00 | 0.00 | 0.00 | 0.00 |
Video-Llama3 | 27.12 | 21.93 | 1.38 | 41.94 | 35.73 | 19.80 | 8.68 | 22.97 | 3.17 | 0.76 | 0.11 | 0.89 |
Llava-Video | 27.11 | 20.64 | 1.50 | 49.48 | 15.12 | 6.30 | 1.43 | 10.52 | 5.23 | 0.94 | 0.18 | 1.92 |
VTimeLLM | 24.18 | 19.60 | 1.45 | 41.46 | 25.24 | 10.88 | 3.15 | 17.13 | 0.62 | 0.14 | 0.03 | 0.21 |
InternVL-2.5 | 22.69 | 17.85 | 1.46 | 44.18 | 11.98 | 4.87 | 2.34 | 8.72 | 2.18 | 0.27 | 0.04 | 0.65 |
VideoChat2 | 20.74 | 17.47 | 1.27 | 36.21 | 20.47 | 13.07 | 6.49 | 13.69 | 10.06 | 1.31 | 0.14 | 2.51 |
Qwen2-VL | 20.35 | 18.13 | 1.03 | 25.91 | 27.96 | 17.94 | 9.16 | 19.18 | 28.59 | 12.21 | 3.89 | 9.31 |
Sa2VA | 19.00 | 16.26 | 0.70 | 16.36 | 0.10 | 0.00 | 0.00 | 0.11 | 52.16 | 42.68 | 34.18 | 32.31 |
Oryx-1.5 | 16.05 | 14.72 | 0.94 | 20.47 | 17.03 | 4.48 | 1.72 | 13.54 | 35.58 | 11.60 | 2.17 | 10.14 |
TimeChat | 14.47 | 12.80 | 1.06 | 26.38 | 17.80 | 8.68 | 3.48 | 12.01 | 0.00 | 0.00 | 0.00 | 0.00 |
TRACE | 13.78 | 12.45 | 0.90 | 17.60 | 28.53 | 14.17 | 6.73 | 19.74 | 0.00 | 0.00 | 0.00 | 0.00 |
Model | LGM | AM | Score | Acc | AP@0.1 | AP@0.3 | AP@0.5 | m_vIoU | R1@0.3 | R1@0.5 | R1@0.7 | m_tIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Video-CCAM-v1.2 | 36.79 | 25.53 | 1.71 | 60.78 | 58.47 | 49.47 | 40.42 | 37.48 | 17.13 | 10.04 | 7.25 | 12.82 |
Model | LGM | AM | Score | Acc | AP@0.1 | AP@0.3 | AP@0.5 | m_vIoU | R1@0.3 | R1@0.5 | R1@0.7 | m_tIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 36.79 | 25.53 | 1.71 | 60.78 | 9.29 | 4.18 | 1.19 | 3.01 | 17.13 | 10.04 | 7.25 | 12.82 |
Gemini-2-Flash | 34.99 | 26.35 | 1.59 | 53.01 | 7.49 | 1.89 | 0.58 | 2.21 | 31.58 | 15.22 | 8.54 | 23.83 |
Video-CCAM-v1.2 | 30.88 | 20.54 | 1.75 | 59.35 | 0.00 | 0.00 | 0.00 | 0.00 | 2.19 | 0.00 | 0.00 | 2.26 |
Qwen2.5-VL | 29.58 | 21.38 | 1.61 | 54.53 | 5.15 | 2.87 | 1.40 | 2.00 | 11.02 | 5.39 | 2.48 | 7.61 |
Llava-Video | 27.54 | 21.00 | 1.50 | 49.48 | 4.29 | 1.23 | 0.25 | 1.31 | 16.89 | 5.49 | 2.00 | 12.21 |
InternVL-2.5 | 27.15 | 17.36 | 1.46 | 44.18 | 0.42 | 0.03 | 0.00 | 0.14 | 10.83 | 3.77 | 1.57 | 7.75 |
Video-Llama3 | 26.96 | 21.76 | 1.38 | 41.94 | 0.62 | 0.17 | 0.02 | 0.19 | 35.11 | 20.42 | 9.21 | 23.14 |
Sa2VA | 21.61 | 17.95 | 0.70 | 16.36 | 58.47 | 49.47 | 40.42 | 37.48 | 0.00 | 0.00 | 0.00 | 0.00 |
VTimeLLM | 19.90 | 15.81 | 1.45 | 41.46 | 0.00 | 0.00 | 0.00 | 0.00 | 8.44 | 4.53 | 2.10 | 5.96 |
VideoChat2 | 19.77 | 16.56 | 1.27 | 36.21 | 3.08 | 0.91 | 0.30 | 0.97 | 18.08 | 12.07 | 6.20 | 12.50 |
Qwen2-VL | 17.23 | 15.28 | 1.03 | 25.91 | 7.11 | 3.55 | 1.14 | 2.41 | 24.62 | 16.32 | 8.25 | 17.52 |
TimeChat | 15.08 | 13.33 | 1.06 | 26.38 | 0.00 | 0.00 | 0.00 | 0.00 | 20.42 | 8.54 | 2.53 | 13.60 |
Oryx-1.5 | 14.16 | 12.93 | 0.94 | 20.47 | 11.50 | 4.32 | 0.96 | 3.50 | 18.99 | 5.58 | 2.72 | 14.81 |
TRACE | 12.71 | 11.57 | 0.90 | 17.60 | 0.00 | 0.00 | 0.00 | 0.00 | 24.52 | 12.02 | 5.73 | 17.11 |
Model,Unnamed: 0_level_1 | Animals,mAM | Animals,mLGM | Nature,mAM | Nature,mLGM | Shows,mAM | Shows,mLGM | Daily Life,mAM | Daily Life,mLGM | Sports,mAM | Sports,mLGM | Entertainments,mAM | Entertainments,mLGM | Vehicles,mAM | Vehicles,mLGM | Indoor,mAM | Indoor,mLGM | Tutorial,mAM | Tutorial,mLGM |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Video-CCAM-v1.2 | 29.18 | 42.13 | 31.04 | 45.14 | 26.96 | 38.50 | 26.61 | 37.82 | 26.50 | 36.36 | 28.47 | 46.41 | 26.89 | 37.29 | 24.01 | 31.16 | 16.68 | 22.30 |
Model,Unnamed: 0_level_1 | Animals,mAM | Animals,mLGM | Nature,mAM | Nature,mLGM | Shows,mAM | Shows,mLGM | Daily Life,mAM | Daily Life,mLGM | Sports,mAM | Sports,mLGM | Entertainments,mAM | Entertainments,mLGM | Vehicles,mAM | Vehicles,mLGM | Indoor,mAM | Indoor,mLGM | Tutorial,mAM | Tutorial,mLGM |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 29.18 | 42.13 | 31.04 | 45.14 | 26.96 | 38.50 | 26.61 | 37.82 | 26.50 | 36.36 | 28.47 | 46.41 | 26.89 | 37.29 | 24.01 | 31.16 | 16.68 | 22.30 |
Gemini-2-Flash | 27.78 | 36.26 | 23.42 | 28.58 | 28.79 | 38.39 | 23.73 | 29.81 | 24.35 | 31.26 | 33.35 | 52.00 | 24.80 | 31.59 | 22.28 | 27.43 | 38.33 | 56.62 |
Qwen2.5-VL | 26.32 | 36.26 | 27.07 | 38.12 | 23.74 | 32.13 | 26.10 | 35.92 | 23.24 | 31.39 | 20.52 | 27.36 | 29.94 | 33.52 | 25.61 | 35.04 | 2.76 | 2.87 |
Video-CCAM-v1.2 | 23.47 | 39.49 | 26.19 | 50.25 | 28.54 | 56.27 | 19.96 | 29.73 | 21.99 | 35.18 | 16.42 | 22.06 | 25.59 | 46.08 | 21.22 | 32.89 | 15.03 | 19.89 |
Video-Llama3 | 21.43 | 26.10 | 24.37 | 31.35 | 24.13 | 31.12 | 18.74 | 22.10 | 22.41 | 28.51 | 24.70 | 31.58 | 23.15 | 29.17 | 20.00 | 24.43 | 24.68 | 33.74 |
Sa2VA | 21.01 | 27.03 | 13.62 | 16.60 | 23.38 | 29.13 | 17.89 | 23.25 | 17.01 | 19.69 | 13.28 | 14.90 | 21.36 | 26.25 | 18.44 | 22.96 | 10.91 | 11.97 |
InternVL-2.5 | 20.98 | 28.59 | 17.74 | 22.44 | 20.10 | 28.76 | 17.70 | 22.67 | 18.86 | 24.98 | 15.34 | 18.59 | 20.31 | 27.10 | 17.72 | 22.49 | 9.35 | 10.76 |
Llava-Video | 20.34 | 25.96 | 24.57 | 33.88 | 25.03 | 37.03 | 20.08 | 25.58 | 21.33 | 28.78 | 20.24 | 27.10 | 23.76 | 34.20 | 20.07 | 25.40 | 21.97 | 32.82 |
VideoChat2 | 18.50 | 22.58 | 20.23 | 25.05 | 19.11 | 24.67 | 16.03 | 18.98 | 16.88 | 20.23 | 18.27 | 21.46 | 15.91 | 18.93 | 15.68 | 18.65 | 7.09 | 7.60 |
TimeChat | 18.26 | 22.52 | 12.61 | 14.14 | 13.75 | 15.86 | 12.48 | 14.02 | 13.63 | 15.64 | 11.17 | 12.27 | 15.46 | 18.22 | 13.70 | 15.64 | 6.05 | 6.53 |
Qwen2-VL | 16.57 | 18.34 | 16.46 | 18.37 | 21.85 | 27.75 | 12.80 | 13.81 | 16.34 | 18.59 | 20.75 | 24.82 | 16.41 | 18.59 | 15.10 | 16.69 | 17.29 | 22.23 |
VTimeLLM | 14.54 | 16.93 | 20.28 | 26.24 | 22.93 | 33.50 | 20.33 | 27.43 | 16.12 | 19.99 | 15.65 | 18.46 | 18.12 | 23.40 | 18.36 | 22.88 | 8.67 | 9.56 |
Oryx-1.5 | 13.58 | 14.70 | 11.05 | 11.83 | 20.00 | 24.40 | 9.82 | 10.39 | 14.39 | 16.03 | 16.14 | 18.85 | 14.09 | 15.43 | 14.30 | 15.60 | 10.34 | 11.78 |
TRACE | 12.35 | 13.77 | 10.28 | 11.23 | 18.85 | 22.75 | 8.59 | 9.25 | 15.31 | 17.41 | 13.99 | 15.72 | 12.87 | 14.32 | 10.91 | 11.92 | 15.19 | 17.34 |
V-STaR, a comprehensive spatio-temporal reasoning benchmark for video large language models (Video-LLMs). We construct a fine-grained reasoning dataset with coarse-to-fine CoT questions, enabling a structured evaluation of spatio-temporal reasoning. Specifically, we introduce a Reverse Spatio-Temporal Reasoning (RSTR) task to quantify modelsโ spatio-temporal reasoning ability. For each dimension and each content category, we carefully design a Prompt Suite as test cases, and sample Generated Videos from a set of video generation models. Experiments on V-STaR reveal although many models perform well on โwhatโ, some struggle to ground their answers in time and location. This finding highlights a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning and inspires research in improving trustworthy spatio-temporal understanding in future Video-LLMs.
Submit on V-STaR Benchmark Introduction
๐
โ ๏ธ Please note that you need to obtain the file results/*.json
by running V-STaR in Github. You may conduct an Offline Eval before submitting.
โ ๏ธ Then, please contact us to update your results via email1 or email2.