2026.01.15🌐 Our Official LeaderBoard is now live! Welcome to test and submit.2026.01.14🏷️ We updateVideoDR.csvwith additionalCategoryandDifficultylabels.2026.01.12🌟 We release VideoDR benchmark data. You can download it from there.2026.01.11🌟 We are very proud to launch VideoDR, the first-ever video deep research benchmark!
🚀 VideoDR is the first video deep research benchmark!
It is designed to evaluate the capability of Video Agent to perform complex reasoning based on video content while leveraging the Open Web 🌐.
- 🎞️ Multi-frame Visual Cues: Accurately identify continuous key information from multiple video frames.
- 🌍 Interactive Search: Interact with a browser environment to perform multi-hop deep search.
- 🧩 Evidence Synthesis: Combine video clues and web evidence to provide a verifiable factual answer.
We provide LLM-based evaluation tools (llm_as_judge) for model evaluation and failure analysis.
cd llm_as_judge
pip install -r requirements.txtCreate a .env file in the llm_as_judge directory:
LLM_BASE_URL=your_api_base_url
LLM_API_KEY=your_api_keypython llm_as_judge/src/judge_answers.py \
--workers 5 \
--predictions llm_as_judge/data/predictions.json
python llm_as_judge/src/analyze_failure_types.py \
--excel_file llm_as_judge/data/Video-LLM.xlsx \
--trace_dir results/traces \
--max_workers 4
If you find this benchmark useful for your research, please cite:
@article{liu2026watching,
title={Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning},
author={Liu, Chengwen and Yu, Xiaomin and Chang, Zhuoyue and Huang, Zhe and Zhang, Shuo and Lian, Heng and Wang, Kunyi and Xu, Rui and Hu, Sen and Hou, Jianheng and others},
journal={arXiv preprint arXiv:2601.06943},
year={2026}
}Have a question? If you have any questions or just want to say hi, feel free to reach out:
📧 Email: yuxm02@gmail.com
