V-STaR Leaderboard

"Can Video-LLMs โ€œreason through a sequential spatio-temporal logicโ€ in videos?"
๐Ÿ† Welcome to the leaderboard of the V-STaR! ๐ŸŽฆ A spatio-temporal reasoning benchmark for Video-LLMs

  • Comprehensive Dimensions: We evaluate Video-LLMโ€™s spatio-temporal reasoning ability in answering questions explicitly in the context of โ€œwhenโ€, โ€œwhereโ€, and โ€œwhatโ€.
  • Human Alignment: We conducted extensive experiments and human annotations to validate robustness of V-STaR.
  • New Metrics: We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains.
  • Valuable Insights: V-STaR reveals a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning.

Join Leaderboard: Please contact us to update your results.

Credits: This leaderboard is updated and maintained by the team of V-STaR Contributors.

Model-Unnamed: 0_level_1
All-mLGM
All-mAM
Short-mAM
Short-mLGM
Medium-mAM
Medium-mLGM
Long-mAM
Long-mLGM
Video-CCAM-v1.2
38.15
26.75
27.49
38.56
26.96
40.58
14.86
19.28

Submit on V-STaR Benchmark Introduction

๐ŸŽˆ

โš ๏ธ Please note that you need to obtain the file results/*.json by running V-STaR in Github. You may conduct an Offline Eval before submitting.

โš ๏ธ Then, please contact us to update your results via email1 or email2.