V-STaR Leaderboard

"Can Video-LLMs “reason through a sequential spatio-temporal logic” in videos?"
🏆 Welcome to the leaderboard of the V-STaR! 🎦 A spatio-temporal reasoning benchmark for Video-LLMs

Comprehensive Dimensions: We evaluate Video-LLM’s spatio-temporal reasoning ability in answering questions explicitly in the context of “when”, “where”, and “what”.
Human Alignment: We conducted extensive experiments and human annotations to validate robustness of V-STaR.
New Metrics: We proposed to use Arithmetic Mean (AM) and modified logarithmic Geometric Mean (LGM) to measure the spatio-temporal reasoning capability of Video-LLMs. We calculate AM and LGM from the "Accuracy" of VQA, "m_tIoU" of Temporal grounding and "m_vIoU" of Spatial Grounding, and we get the mean AM (mAM) and mean LGM (mLGM) from the results of our proposed 2 RSTR question chains.
Valuable Insights: V-STaR reveals a fundamental weakness in existing Video-LLMs regarding causal spatio-temporal reasoning.

Join Leaderboard: Please contact us to update your results.

Credits: This leaderboard is updated and maintained by the team of V-STaR Contributors.

Model-Unnamed: 0_level_1	All-mLGM	All-mAM	Short-mAM	Short-mLGM	Medium-mAM	Medium-mLGM	Long-mAM	Long-mLGM
Video-CCAM-v1.2	38.15	26.75	27.49	38.56	26.96	40.58	14.86	19.28

Model-Unnamed: 0_level_1	All-mLGM	All-mAM	Short-mAM	Short-mLGM	Medium-mAM	Medium-mLGM	Long-mAM	Long-mLGM
GPT-4o	38.15	26.75	27.49	38.56	26.96	40.58	14.86	19.28
Gemini-2-Flash	35.57	26.87	24.97	32.07	28.99	40.35	37.81	56.14
Qwen2.5-VL	32.39	23.96	25.51	34.84	23.67	32.87	2.20	2.27
Video-CCAM-v1.2	30.70	20.41	21.66	34.09	19.62	28.36	12.61	15.80
Llava-Video	27.33	20.82	22.37	30.23	18.28	22.77	18.23	25.23
Video-Llama3	27.04	21.66	21.68	26.62	21.84	27.23	22.46	28.83
InternVL-2.5	24.92	17.60	17.94	22.90	17.94	23.06	9.58	11.19
VTimeLLM	22.04	17.71	18.31	23.19	18.15	22.44	5.52	5.89
Sa2VA	20.31	17.11	18.14	22.01	16.32	18.92	8.85	9.70
VideoChat2	20.26	17.02	17.57	21.02	17.20	20.50	5.28	5.64
Qwen2-VL	18.79	16.71	15.78	17.50	18.47	21.22	14.09	17.53
TimeChat	14.78	13.07	13.70	15.56	13.22	15.06	3.24	3.37
Oryx-1.5	13.83	15.11	13.17	14.25	14.83	16.46	11.89	13.99
TRACE	13.25	12.01	11.77	12.96	12.49	13.87	13.59	15.30

Model	LGM	AM	Score	Acc	R1@0.3	R1@0.5	R1@0.7	m_tIoU	AP@0.1	AP@0.3	AP@0.5	m_vIoU
Video-CCAM-v1.2	39.51	27.97	1.71	60.78	23.14	10.35	5.10	16.67	19.92	19.92	34.18	13.59

Model	LGM	AM	Score	Acc	R1@0.3	R1@0.5	R1@0.7	m_tIoU	AP@0.1	AP@0.3	AP@0.5	m_vIoU
GPT-4o	39.51	27.97	1.71	60.78	23.14	10.35	5.10	16.67	19.92	8.36	2.75	6.47
Gemini-2-Flash	36.14	27.39	1.59	53.01	31.63	15.84	9.45	24.54	15.67	3.82	0.93	4.63
Qwen2.5-VL	35.20	26.53	1.61	54.53	17.03	8.92	3.72	11.48	35.89	19.92	8.36	13.59
Video-CCAM-v1.2	30.51	20.28	1.75	59.35	1.15	0.00	0.00	1.50	0.00	0.00	0.00	0.00
Video-Llama3	27.12	21.93	1.38	41.94	35.73	19.80	8.68	22.97	3.17	0.76	0.11	0.89
Llava-Video	27.11	20.64	1.50	49.48	15.12	6.30	1.43	10.52	5.23	0.94	0.18	1.92
VTimeLLM	24.18	19.60	1.45	41.46	25.24	10.88	3.15	17.13	0.62	0.14	0.03	0.21
InternVL-2.5	22.69	17.85	1.46	44.18	11.98	4.87	2.34	8.72	2.18	0.27	0.04	0.65
VideoChat2	20.74	17.47	1.27	36.21	20.47	13.07	6.49	13.69	10.06	1.31	0.14	2.51
Qwen2-VL	20.35	18.13	1.03	25.91	27.96	17.94	9.16	19.18	28.59	12.21	3.89	9.31
Sa2VA	19.00	16.26	0.70	16.36	0.10	0.00	0.00	0.11	52.16	42.68	34.18	32.31
Oryx-1.5	16.05	14.72	0.94	20.47	17.03	4.48	1.72	13.54	35.58	11.60	2.17	10.14
TimeChat	14.47	12.80	1.06	26.38	17.80	8.68	3.48	12.01	0.00	0.00	0.00	0.00
TRACE	13.78	12.45	0.90	17.60	28.53	14.17	6.73	19.74	0.00	0.00	0.00	0.00

Model	LGM	AM	Score	Acc	AP@0.1	AP@0.3	AP@0.5	m_vIoU	R1@0.3	R1@0.5	R1@0.7	m_tIoU
Video-CCAM-v1.2	36.79	25.53	1.71	60.78	58.47	49.47	40.42	37.48	17.13	10.04	7.25	12.82

Model	LGM	AM	Score	Acc	AP@0.1	AP@0.3	AP@0.5	m_vIoU	R1@0.3	R1@0.5	R1@0.7	m_tIoU
GPT-4o	36.79	25.53	1.71	60.78	9.29	4.18	1.19	3.01	17.13	10.04	7.25	12.82
Gemini-2-Flash	34.99	26.35	1.59	53.01	7.49	1.89	0.58	2.21	31.58	15.22	8.54	23.83
Video-CCAM-v1.2	30.88	20.54	1.75	59.35	0.00	0.00	0.00	0.00	2.19	0.00	0.00	2.26
Qwen2.5-VL	29.58	21.38	1.61	54.53	5.15	2.87	1.40	2.00	11.02	5.39	2.48	7.61
Llava-Video	27.54	21.00	1.50	49.48	4.29	1.23	0.25	1.31	16.89	5.49	2.00	12.21
InternVL-2.5	27.15	17.36	1.46	44.18	0.42	0.03	0.00	0.14	10.83	3.77	1.57	7.75
Video-Llama3	26.96	21.76	1.38	41.94	0.62	0.17	0.02	0.19	35.11	20.42	9.21	23.14
Sa2VA	21.61	17.95	0.70	16.36	58.47	49.47	40.42	37.48	0.00	0.00	0.00	0.00
VTimeLLM	19.90	15.81	1.45	41.46	0.00	0.00	0.00	0.00	8.44	4.53	2.10	5.96
VideoChat2	19.77	16.56	1.27	36.21	3.08	0.91	0.30	0.97	18.08	12.07	6.20	12.50
Qwen2-VL	17.23	15.28	1.03	25.91	7.11	3.55	1.14	2.41	24.62	16.32	8.25	17.52
TimeChat	15.08	13.33	1.06	26.38	0.00	0.00	0.00	0.00	20.42	8.54	2.53	13.60
Oryx-1.5	14.16	12.93	0.94	20.47	11.50	4.32	0.96	3.50	18.99	5.58	2.72	14.81
TRACE	12.71	11.57	0.90	17.60	0.00	0.00	0.00	0.00	24.52	12.02	5.73	17.11

Model,Unnamed: 0_level_1	Animals,mAM	Animals,mLGM	Nature,mAM	Nature,mLGM	Shows,mAM	Shows,mLGM	Daily Life,mAM	Daily Life,mLGM	Sports,mAM	Sports,mLGM	Entertainments,mAM	Entertainments,mLGM	Vehicles,mAM	Vehicles,mLGM	Indoor,mAM	Indoor,mLGM	Tutorial,mAM	Tutorial,mLGM
Video-CCAM-v1.2	29.18	42.13	31.04	45.14	26.96	38.50	26.61	37.82	26.50	36.36	28.47	46.41	26.89	37.29	24.01	31.16	16.68	22.30

Model,Unnamed: 0_level_1	Animals,mAM	Animals,mLGM	Nature,mAM	Nature,mLGM	Shows,mAM	Shows,mLGM	Daily Life,mAM	Daily Life,mLGM	Sports,mAM	Sports,mLGM	Entertainments,mAM	Entertainments,mLGM	Vehicles,mAM	Vehicles,mLGM	Indoor,mAM	Indoor,mLGM	Tutorial,mAM	Tutorial,mLGM
GPT-4o	29.18	42.13	31.04	45.14	26.96	38.50	26.61	37.82	26.50	36.36	28.47	46.41	26.89	37.29	24.01	31.16	16.68	22.30
Gemini-2-Flash	27.78	36.26	23.42	28.58	28.79	38.39	23.73	29.81	24.35	31.26	33.35	52.00	24.80	31.59	22.28	27.43	38.33	56.62
Qwen2.5-VL	26.32	36.26	27.07	38.12	23.74	32.13	26.10	35.92	23.24	31.39	20.52	27.36	29.94	33.52	25.61	35.04	2.76	2.87
Video-CCAM-v1.2	23.47	39.49	26.19	50.25	28.54	56.27	19.96	29.73	21.99	35.18	16.42	22.06	25.59	46.08	21.22	32.89	15.03	19.89
Video-Llama3	21.43	26.10	24.37	31.35	24.13	31.12	18.74	22.10	22.41	28.51	24.70	31.58	23.15	29.17	20.00	24.43	24.68	33.74
Sa2VA	21.01	27.03	13.62	16.60	23.38	29.13	17.89	23.25	17.01	19.69	13.28	14.90	21.36	26.25	18.44	22.96	10.91	11.97
InternVL-2.5	20.98	28.59	17.74	22.44	20.10	28.76	17.70	22.67	18.86	24.98	15.34	18.59	20.31	27.10	17.72	22.49	9.35	10.76
Llava-Video	20.34	25.96	24.57	33.88	25.03	37.03	20.08	25.58	21.33	28.78	20.24	27.10	23.76	34.20	20.07	25.40	21.97	32.82
VideoChat2	18.50	22.58	20.23	25.05	19.11	24.67	16.03	18.98	16.88	20.23	18.27	21.46	15.91	18.93	15.68	18.65	7.09	7.60
TimeChat	18.26	22.52	12.61	14.14	13.75	15.86	12.48	14.02	13.63	15.64	11.17	12.27	15.46	18.22	13.70	15.64	6.05	6.53
Qwen2-VL	16.57	18.34	16.46	18.37	21.85	27.75	12.80	13.81	16.34	18.59	20.75	24.82	16.41	18.59	15.10	16.69	17.29	22.23
VTimeLLM	14.54	16.93	20.28	26.24	22.93	33.50	20.33	27.43	16.12	19.99	15.65	18.46	18.12	23.40	18.36	22.88	8.67	9.56
Oryx-1.5	13.58	14.70	11.05	11.83	20.00	24.40	9.82	10.39	14.39	16.03	16.14	18.85	14.09	15.43	14.30	15.60	10.34	11.78
TRACE	12.35	13.77	10.28	11.23	18.85	22.75	8.59	9.25	15.31	17.41	13.99	15.72	12.87	14.32	10.91	11.92	15.19	17.34

Submit on V-STaR Benchmark Introduction

🎈

⚠️ Please note that you need to obtain the file results/*.json by running V-STaR in Github. You may conduct an Offline Eval before submitting.

⚠️ Then, please contact us to update your results via email1 or email2.