VideoDR: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

🔥 News

2026.01.15 🌐 Our Official LeaderBoard is now live! Welcome to test and submit.
2026.01.14 🏷️ We update VideoDR.csv with additional Category and Difficulty labels.
2026.01.12 🌟 We release VideoDR benchmark data. You can download it from there.
2026.01.11 🌟 We are very proud to launch VideoDR, the first-ever video deep research benchmark!

🎥 Video Deep Research | VideoDR

🚀 VideoDR is the first video deep research benchmark!

It is designed to evaluate the capability of Video Agent to perform complex reasoning based on video content while leveraging the Open Web 🌐.

👇 VideoDR requires the Agent to possess the following core capabilities:

🎞️ Multi-frame Visual Cues: Accurately identify continuous key information from multiple video frames.
🌍 Interactive Search: Interact with a browser environment to perform multi-hop deep search.
🧩 Evidence Synthesis: Combine video clues and web evidence to provide a verifiable factual answer.

🔧 Evaluation Tools

We provide LLM-based evaluation tools (llm_as_judge) for model evaluation and failure analysis.

Installation

cd llm_as_judge
pip install -r requirements.txt

Configuration

Create a .env file in the llm_as_judge directory:

LLM_BASE_URL=your_api_base_url
LLM_API_KEY=your_api_key

LLM as Judge

python llm_as_judge/src/judge_answers.py \
    --workers 5 \
    --predictions llm_as_judge/data/predictions.json

Failure analysis

python llm_as_judge/src/analyze_failure_types.py \
    --excel_file llm_as_judge/data/Video-LLM.xlsx \
    --trace_dir results/traces \
    --max_workers 4

📚 Citation

If you find this benchmark useful for your research, please cite:

@article{liu2026watching,
  title={Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning},
  author={Liu, Chengwen and Yu, Xiaomin and Chang, Zhuoyue and Huang, Zhe and Zhang, Shuo and Lian, Heng and Wang, Kunyi and Xu, Rui and Hu, Sen and Hou, Jianheng and others},
  journal={arXiv preprint arXiv:2601.06943},
  year={2026}
}

✉️ Contact

Have a question? If you have any questions or just want to say hi, feel free to reach out:

📧 Email: yuxm02@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
llm_as_judge		llm_as_judge
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoDR: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

🔥 News

🎥 Video Deep Research | VideoDR

👇 VideoDR requires the Agent to possess the following core capabilities:

🔧 Evaluation Tools

Installation

Configuration

LLM as Judge

Failure analysis

📚 Citation

✉️ Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

QuantaAlpha/VideoDR-Benchmark

Folders and files

Latest commit

History

Repository files navigation

VideoDR: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

🔥 News

🎥 Video Deep Research | VideoDR

👇 VideoDR requires the Agent to possess the following core capabilities:

🔧 Evaluation Tools

Installation

Configuration

LLM as Judge

Failure analysis

📚 Citation

✉️ Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages