V-STaR Leaderboard
"Can Video-LLMs โreason through a sequential spatio-temporal logicโ in videos?"
๐ Welcome to the leaderboard of the V-STaR! ๐ฆ A spatio-temporal reasoning benchmark for Video-LLMs    
- Comprehensive Dimensions: We evaluate Video-LLMโs spatio-temporal reasoning ability in answering questions explicitly in the context of โwhenโ, โwhereโ, and โwhatโ.
 - Human Alignment: We conducted extensive experiments and human annotations to validate robustness of V-STaR.
 - New Metrics: We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains.
 - Valuable Insights: V-STaR reveals a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning.
 
Join Leaderboard: Please contact us to update your results.
Credits: This leaderboard is updated and maintained by the team of V-STaR Contributors.
 Model-Unnamed: 0_level_1     |  All-mLGM     |  All-mAM     |  Short-mAM     |  Short-mLGM     |  Medium-mAM     |  Medium-mLGM     |  Long-mAM     |  Long-mLGM     | 
|---|---|---|---|---|---|---|---|---|
 Video-CCAM-v1.2    |  38.15    |  26.75    |  27.49    |  38.56    |  26.96    |  40.58    |  14.86    |  19.28    | 
 Model     |  LGM     |  AM     |  Score     |  Acc     |  R1@0.3     |  R1@0.5     |   R1@0.7     |   m_tIoU     |  AP@0.1     |  AP@0.3     |  AP@0.5     |  m_vIoU     | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
 Video-CCAM-v1.2    |  39.51    |  27.97    |  1.71    |  60.78    |  23.14    |  10.35    |  5.10    |  16.67    |  19.92    |  19.92    |  34.18    |  13.59    | 
 Model     |  LGM     |  AM     |  Score     |  Acc     |  R1@0.3     |  R1@0.5     |   R1@0.7     |   m_tIoU     |  AP@0.1     |  AP@0.3     |  AP@0.5     |  m_vIoU     | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
 GPT-4o     |  39.51     |  27.97     |  1.71     |  60.78     |  23.14     |  10.35     |  5.10     |  16.67     |  19.92     |  8.36     |  2.75     |  6.47     | 
 Gemini-2-Flash     |  36.14     |  27.39     |  1.59     |  53.01     |  31.63     |  15.84     |  9.45     |  24.54     |  15.67     |  3.82     |  0.93     |  4.63     | 
 Qwen2.5-VL     |  35.20     |  26.53     |  1.61     |  54.53     |  17.03     |  8.92     |  3.72     |  11.48     |  35.89     |  19.92     |  8.36     |  13.59     | 
 Video-CCAM-v1.2     |  30.51     |  20.28     |  1.75     |  59.35     |  1.15     |  0.00     |  0.00     |  1.50     |  0.00     |  0.00     |  0.00     |  0.00     | 
 Video-Llama3     |  27.12     |  21.93     |  1.38     |  41.94     |  35.73     |  19.80     |  8.68     |  22.97     |  3.17     |  0.76     |  0.11     |  0.89     | 
 Llava-Video     |  27.11     |  20.64     |  1.50     |  49.48     |  15.12     |  6.30     |  1.43     |  10.52     |  5.23     |  0.94     |  0.18     |  1.92     | 
 VTimeLLM     |  24.18     |  19.60     |  1.45     |  41.46     |  25.24     |  10.88     |  3.15     |  17.13     |  0.62     |  0.14     |  0.03     |  0.21     | 
 InternVL-2.5     |  22.69     |  17.85     |  1.46     |  44.18     |  11.98     |  4.87     |  2.34     |  8.72     |  2.18     |  0.27     |  0.04     |  0.65     | 
 VideoChat2     |  20.74     |  17.47     |  1.27     |  36.21     |  20.47     |  13.07     |  6.49     |  13.69     |  10.06     |  1.31     |  0.14     |  2.51     | 
 Qwen2-VL     |  20.35     |  18.13     |  1.03     |  25.91     |  27.96     |  17.94     |  9.16     |  19.18     |  28.59     |  12.21     |  3.89     |  9.31     | 
 Sa2VA     |  19.00     |  16.26     |  0.70     |  16.36     |  0.10     |  0.00     |  0.00     |  0.11     |  52.16     |  42.68     |  34.18     |  32.31     | 
 Oryx-1.5     |  16.05     |  14.72     |  0.94     |  20.47     |  17.03     |  4.48     |  1.72     |  13.54     |  35.58     |  11.60     |  2.17     |  10.14     | 
 TimeChat     |  14.47     |  12.80     |  1.06     |  26.38     |  17.80     |  8.68     |  3.48     |  12.01     |  0.00     |  0.00     |  0.00     |  0.00     | 
 TRACE     |  13.78     |  12.45     |  0.90     |  17.60     |  28.53     |  14.17     |  6.73     |  19.74     |  0.00     |  0.00     |  0.00     |  0.00     | 
 Model     |  LGM     |  AM     |  Score     |  Acc     |  AP@0.1     |  AP@0.3     |  AP@0.5     |  m_vIoU     |  R1@0.3     |  R1@0.5     |   R1@0.7     |   m_tIoU     | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
 Video-CCAM-v1.2    |  36.79    |  25.53    |  1.71    |  60.78    |  58.47    |  49.47    |  40.42    |  37.48    |  17.13    |  10.04    |  7.25    |  12.82    | 
 Model     |  LGM     |  AM     |  Score     |  Acc     |  AP@0.1     |  AP@0.3     |  AP@0.5     |  m_vIoU     |  R1@0.3     |  R1@0.5     |   R1@0.7     |   m_tIoU     | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
 GPT-4o     |  36.79     |  25.53     |  1.71     |  60.78     |  9.29     |  4.18     |  1.19     |  3.01     |  17.13     |  10.04     |  7.25     |  12.82     | 
 Gemini-2-Flash     |  34.99     |  26.35     |  1.59     |  53.01     |  7.49     |  1.89     |  0.58     |  2.21     |  31.58     |  15.22     |  8.54     |  23.83     | 
 Video-CCAM-v1.2     |  30.88     |  20.54     |  1.75     |  59.35     |  0.00     |  0.00     |  0.00     |  0.00     |  2.19     |  0.00     |  0.00     |  2.26     | 
 Qwen2.5-VL     |  29.58     |  21.38     |  1.61     |  54.53     |  5.15     |  2.87     |  1.40     |  2.00     |  11.02     |  5.39     |  2.48     |  7.61     | 
 Llava-Video     |  27.54     |  21.00     |  1.50     |  49.48     |  4.29     |  1.23     |  0.25     |  1.31     |  16.89     |  5.49     |  2.00     |  12.21     | 
 InternVL-2.5     |  27.15     |  17.36     |  1.46     |  44.18     |  0.42     |  0.03     |  0.00     |  0.14     |  10.83     |  3.77     |  1.57     |  7.75     | 
 Video-Llama3     |  26.96     |  21.76     |  1.38     |  41.94     |  0.62     |  0.17     |  0.02     |  0.19     |  35.11     |  20.42     |  9.21     |  23.14     | 
 Sa2VA     |  21.61     |  17.95     |  0.70     |  16.36     |  58.47     |  49.47     |  40.42     |  37.48     |  0.00     |  0.00     |  0.00     |  0.00     | 
 VTimeLLM     |  19.90     |  15.81     |  1.45     |  41.46     |  0.00     |  0.00     |  0.00     |  0.00     |  8.44     |  4.53     |  2.10     |  5.96     | 
 VideoChat2     |  19.77     |  16.56     |  1.27     |  36.21     |  3.08     |  0.91     |  0.30     |  0.97     |  18.08     |  12.07     |  6.20     |  12.50     | 
 Qwen2-VL     |  17.23     |  15.28     |  1.03     |  25.91     |  7.11     |  3.55     |  1.14     |  2.41     |  24.62     |  16.32     |  8.25     |  17.52     | 
 TimeChat     |  15.08     |  13.33     |  1.06     |  26.38     |  0.00     |  0.00     |  0.00     |  0.00     |  20.42     |  8.54     |  2.53     |  13.60     | 
 Oryx-1.5     |  14.16     |  12.93     |  0.94     |  20.47     |  11.50     |  4.32     |  0.96     |  3.50     |  18.99     |  5.58     |  2.72     |  14.81     | 
 TRACE     |  12.71     |  11.57     |  0.90     |  17.60     |  0.00     |  0.00     |  0.00     |  0.00     |  24.52     |  12.02     |  5.73     |  17.11     | 
 Model,Unnamed: 0_level_1     |  Animals,mAM     |  Animals,mLGM     |  Nature,mAM     |  Nature,mLGM     |  Shows,mAM     |  Shows,mLGM     |   Daily Life,mAM     |   Daily Life,mLGM     |  Sports,mAM     |  Sports,mLGM     |  Entertainments,mAM     |  Entertainments,mLGM     |  Vehicles,mAM     |  Vehicles,mLGM     |  Indoor,mAM     |  Indoor,mLGM     |  Tutorial,mAM     |  Tutorial,mLGM     | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
 Video-CCAM-v1.2    |  29.18    |  42.13    |  31.04    |  45.14    |  26.96    |  38.50    |  26.61    |  37.82    |  26.50    |  36.36    |  28.47    |  46.41    |  26.89    |  37.29    |  24.01    |  31.16    |  16.68    |  22.30    | 
 Model,Unnamed: 0_level_1     |  Animals,mAM     |  Animals,mLGM     |  Nature,mAM     |  Nature,mLGM     |  Shows,mAM     |  Shows,mLGM     |   Daily Life,mAM     |   Daily Life,mLGM     |  Sports,mAM     |  Sports,mLGM     |  Entertainments,mAM     |  Entertainments,mLGM     |  Vehicles,mAM     |  Vehicles,mLGM     |  Indoor,mAM     |  Indoor,mLGM     |  Tutorial,mAM     |  Tutorial,mLGM     | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
 GPT-4o     |  29.18     |  42.13     |  31.04     |  45.14     |  26.96     |  38.50     |  26.61     |  37.82     |  26.50     |  36.36     |  28.47     |  46.41     |  26.89     |  37.29     |  24.01     |  31.16     |  16.68     |  22.30     | 
 Gemini-2-Flash     |  27.78     |  36.26     |  23.42     |  28.58     |  28.79     |  38.39     |  23.73     |  29.81     |  24.35     |  31.26     |  33.35     |  52.00     |  24.80     |  31.59     |  22.28     |  27.43     |  38.33     |  56.62     | 
 Qwen2.5-VL     |  26.32     |  36.26     |  27.07     |  38.12     |  23.74     |  32.13     |  26.10     |  35.92     |  23.24     |  31.39     |  20.52     |  27.36     |  29.94     |  33.52     |  25.61     |  35.04     |  2.76     |  2.87     | 
 Video-CCAM-v1.2     |  23.47     |  39.49     |  26.19     |  50.25     |  28.54     |  56.27     |  19.96     |  29.73     |  21.99     |  35.18     |  16.42     |  22.06     |  25.59     |  46.08     |  21.22     |  32.89     |  15.03     |  19.89     | 
 Video-Llama3     |  21.43     |  26.10     |  24.37     |  31.35     |  24.13     |  31.12     |  18.74     |  22.10     |  22.41     |  28.51     |  24.70     |  31.58     |  23.15     |  29.17     |  20.00     |  24.43     |  24.68     |  33.74     | 
 Sa2VA     |  21.01     |  27.03     |  13.62     |  16.60     |  23.38     |  29.13     |  17.89     |  23.25     |  17.01     |  19.69     |  13.28     |  14.90     |  21.36     |  26.25     |  18.44     |  22.96     |  10.91     |  11.97     | 
 InternVL-2.5     |  20.98     |  28.59     |  17.74     |  22.44     |  20.10     |  28.76     |  17.70     |  22.67     |  18.86     |  24.98     |  15.34     |  18.59     |  20.31     |  27.10     |  17.72     |  22.49     |  9.35     |  10.76     | 
 Llava-Video     |  20.34     |  25.96     |  24.57     |  33.88     |  25.03     |  37.03     |  20.08     |  25.58     |  21.33     |  28.78     |  20.24     |  27.10     |  23.76     |  34.20     |  20.07     |  25.40     |  21.97     |  32.82     | 
 VideoChat2     |  18.50     |  22.58     |  20.23     |  25.05     |  19.11     |  24.67     |  16.03     |  18.98     |  16.88     |  20.23     |  18.27     |  21.46     |  15.91     |  18.93     |  15.68     |  18.65     |  7.09     |  7.60     | 
 TimeChat     |  18.26     |  22.52     |  12.61     |  14.14     |  13.75     |  15.86     |  12.48     |  14.02     |  13.63     |  15.64     |  11.17     |  12.27     |  15.46     |  18.22     |  13.70     |  15.64     |  6.05     |  6.53     | 
 Qwen2-VL     |  16.57     |  18.34     |  16.46     |  18.37     |  21.85     |  27.75     |  12.80     |  13.81     |  16.34     |  18.59     |  20.75     |  24.82     |  16.41     |  18.59     |  15.10     |  16.69     |  17.29     |  22.23     | 
 VTimeLLM     |  14.54     |  16.93     |  20.28     |  26.24     |  22.93     |  33.50     |  20.33     |  27.43     |  16.12     |  19.99     |  15.65     |  18.46     |  18.12     |  23.40     |  18.36     |  22.88     |  8.67     |  9.56     | 
 Oryx-1.5     |  13.58     |  14.70     |  11.05     |  11.83     |  20.00     |  24.40     |  9.82     |  10.39     |  14.39     |  16.03     |  16.14     |  18.85     |  14.09     |  15.43     |  14.30     |  15.60     |  10.34     |  11.78     | 
 TRACE     |  12.35     |  13.77     |  10.28     |  11.23     |  18.85     |  22.75     |  8.59     |  9.25     |  15.31     |  17.41     |  13.99     |  15.72     |  12.87     |  14.32     |  10.91     |  11.92     |  15.19     |  17.34     | 
V-STaR, a comprehensive spatio-temporal reasoning benchmark for video large language models (Video-LLMs). We construct a fine-grained reasoning dataset with coarse-to-fine CoT questions, enabling a structured evaluation of spatio-temporal reasoning. Specifically, we introduce a Reverse Spatio-Temporal Reasoning (RSTR) task to quantify modelsโ spatio-temporal reasoning ability. For each dimension and each content category, we carefully design a Prompt Suite as test cases, and sample Generated Videos from a set of video generation models. Experiments on V-STaR reveal although many models perform well on โwhatโ, some struggle to ground their answers in time and location. This finding highlights a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning and inspires research in improving trustworthy spatio-temporal understanding in future Video-LLMs.
Submit on V-STaR Benchmark Introduction
๐
โ ๏ธ Please note that you need to obtain the file results/*.json by running V-STaR in Github. You may conduct an Offline Eval before submitting.
โ ๏ธ Then, please contact us to update your results via email1 or email2.