VidHal is a benchmark designed to evaluate and analyze video-based hallucinations in Vision-Language Models (VLMs). It features a diverse set of videos covering five key temporal aspects: Action, Attribute, Object, Event Order, and Direction. To facilitate fine-grained evaluation of video hallucinations, we introduce a novel task of caption ordering alongside multiple-choice question answering. For more details, refer to our paper: VidHal: Benchmarking Hallucinations in Vision LLMs.
- [02/12/2024] Inference and evaluation code for the evaluated models in our paper are now available.
- [25/11/2024] The VidHal dataset and evaluation pipeline are now available. Instructions for evaluating your model using our code can be found in the Model Evaluation section.
The annotations and pre-defined randomized option orders for VidHal are located under the vidhal
folder. The benchmark dataset videos can be downloaded from this link, which should be extracted to vidhal/videos
.
We provide the essential libraries and tools to run our evaluation code in requirements.txt
. Install these dependencies along with those needed for your models using pip
.
We provide code for inference and evaluation on the VidHal benchmark, which can be adapted to suit your model's needs and requirements. Our evaluation pipeline consists of two steps: first, generating model predictions on the VidHal benchmark for a specified evaluation task, and second, comparing the predictions to the ground-truth answers.
The source code for generating model predictions on VidHal instances is located in pipelines/inference
. The skeleton code, including the prompts used in our paper for all evaluation tasks, along with interfaces for running inference, is already implemented. To perform inference on VidHal with your model of choice using our pipeline, you may simply override the code in pipelines/inference/base.py
:
class VidHalInferencePipeline:
...
def format_prompt(
self,
main_prompt,
options_prompt,
system_prompt=None,
*args, **kwargs):
"""
NOTE: Implement this according to your model requirements
Expected return type:
prompts (tuple): Consisting of (main_prompt, system_prompt). If only one prompt is used, system prompt can be left optionally empty
"""
raise NotImplementedError
def generate_response(
self,
model,
video,
main_prompt, system_prompt=None,
generation_config={},
*args, **kwargs):
"""
NOTE: Implement this according to your model requirements
Expected return type:
response (str) : Response generated by the model.
"""
raise NotImplementedError
...
which specifies the prompt format for your model and the response generation logic, respectively. Alternatively, you can create custom inference code by subclassing VidHalInferencePipeline
and its task-specific derivatives and implement the above two functions in those files. An example of this approach for random response generation in present here. If this implementation path is chosen, add your custom inference pipelines to pipelines/inference/__init__.py
to allow them to be loaded by our driver scripts. Here's an example:
def get_inference_pipeline(name, task) -> VidHalInferencePipeline:
return {
...
"my_model": {
"mcqa": MyMCQAInferencePipeline,
"naive_ordering": MyNaiveOrderingInferencePipeline,
"relative_ordering": MyRelativeOrderingInferencePipeline
},
...
}[name][task]
Our inference pipeline code automatically loads the specified models for their corresponding inference tasks. To add your model to the model repository, simply place your model files in the models directory and update the load_model
function in the models/__init__.py
file accordingly. For more details, please refer to the code here.
Finally, model responses can be generated by running inference.py
with the required arguments. An example run is provided below:
python inference.py \
--model <my_model> \
--task <task> \
--annotations_path <annotations_path> \
--videos_path <videos_path> \
--save_path <save_path>
where <task>
specifies the evaluation task to be run and is selected from: mcqa
, naive_ordering
, or relative_ordering
.
Command-line scripts for running inference.py
with the desired arguments are also provided in the scripts/inference
directory. scripts/<task>/run_random_inference.sh
presents an example for generating random predictions, which can be referenced to create your own driver script.
Once predictions for the selected evaluation task are generated, you can evaluate these responses by running evaluate.py. We demonstrated an example run below:
python evaluate.py \
--task <task> \
--annotations_path <annotations_path> \
--predictions_path <path_to_my_model_predictions> # This should be the same as <save_path> in inference
Similarly to the inference stage, command-line scripts for running evaluate.py
are provided in the scripts/evaluation
directory.
We provide the codebase for evaluating the VideoChat2, VideoLLaMA2, mPLUG-Owl3, and LLaVA-NeXT-Video models in our paper. The required libraries for each model are installed according to the specifications in their original source code (refer to the acknowledgements section for links to these repositories). Additionally, we include code for performing inference on VidHal using the proprietary models GPT-4o and Gemini, which can be used directly with your corresponding API keys.
We evaluate several state-of-the-art video VLMs on the VidHal benchmark and present their results below.
VLM | MCQA | Naive Caption Ordering | Relative Caption Ordering |
---|---|---|---|
VideoChat2 (Vicuna) | 0.410 | 0.490 | 0.573 |
VideoChat2 (Mistral) | 0.524 | 0.348 | 0.579 |
VideoChat2 (Phi3) | 0.468 | 0.552 | 0.522 |
mPLUG-Owl3 | 0.596 | 0.641 | 0.707 |
LLaVA-NeXT-Video (7B) | 0.509 | 0.518 | 0.620 |
LLaVA-NeXT-Video (32B) | 0.663 | 0.641 | 0.747 |
VideoLLaMA2 (7B) | 0.541 | 0.564 | 0.622 |
VideoLLaMA2 (72B) | 0.647 | 0.787 | 0.760 |
GPT-4o | 0.772 | 0.840 | 0.826 |
Gemini-1.5 Flash | 0.657 | 0.738 | 0.745 |
Gemini-1.5 Pro | 0.671 | 0.765 | 0.753 |
We sincerely thank the original authors of the following works for making their codebases publicly available, enabling the evaluation of their models on our VidHal benchmark:
If you find our work valuable or useful for your research, please consider citing it.
@article{choong2024vidhal,
title={VidHal: Benchmarking Temporal Hallucinations in Vision LLMs},
author={Wey Yeh Choong and Yangyang Guo and Mohan Kankanhalli},
journal={arXiv preprint arXiv:2411.16771},
year={2024}
}