# VidHal: Benchmarking Hallucinations in Vision LLMs VidHal is a benchmark designed to evaluate and analyze video-based hallucinations in Vision-Language Models (VLMs). It features a diverse set of videos covering five key temporal aspects: _Action, Attribute, Object, Event Order_, and _Direction_. To facilitate fine-grained evaluation of video hallucinations, we introduce a novel task of **caption ordering** alongside multiple-choice question answering. For more details, refer to our paper: [VidHal: Benchmarking Hallucinations in Vision LLMs](https://arxiv.org/abs/2411.16771). ## Updates - [02/12/2024] Inference and evaluation code for the evaluated models in our paper are now available. - [25/11/2024] The VidHal dataset and evaluation pipeline are now available. Instructions for evaluating your model using our code can be found in the [Model Evaluation](#model-evaluation) section. ## Getting Started ### Dataset Download The annotations and pre-defined randomized option orders for VidHal are located under the `vidhal` folder. The benchmark dataset videos can be downloaded from this [link](https://drive.google.com/file/d/1Lrt-ZDv4V09uONAcW34ak7JUNUruHTZl/view?usp=sharing), which should be extracted to `vidhal/videos`. ### Environment Setup We provide the essential libraries and tools to run our evaluation code in `requirements.txt`. Install these dependencies along with those needed for your models using `pip`. ## Model Evaluation We provide code for inference and evaluation on the VidHal benchmark, which can be adapted to suit your model's needs and requirements. Our evaluation pipeline consists of two steps: first, generating model predictions on the VidHal benchmark for a specified evaluation task, and second, comparing the predictions to the ground-truth answers. ### Inference The source code for generating model predictions on VidHal instances is located in `pipelines/inference`. The skeleton code, including the prompts used in our paper for all evaluation tasks, along with interfaces for running inference, is already implemented. To perform inference on VidHal with your model of choice using our pipeline, you may simply override the code in `pipelines/inference/base.py`: ``` class VidHalInferencePipeline: ... def format_prompt( self, main_prompt, options_prompt, system_prompt=None, *args, **kwargs): """ NOTE: Implement this according to your model requirements Expected return type: prompts (tuple): Consisting of (main_prompt, system_prompt). If only one prompt is used, system prompt can be left optionally empty """ raise NotImplementedError def generate_response( self, model, video, main_prompt, system_prompt=None, generation_config={}, *args, **kwargs): """ NOTE: Implement this according to your model requirements Expected return type: response (str) : Response generated by the model. """ raise NotImplementedError ... ``` which specifies the prompt format for your model and the response generation logic, respectively. Alternatively, you can create custom inference code by subclassing `VidHalInferencePipeline` and its task-specific derivatives and implement the above two functions in those files. An example of this approach for random response generation in present [here](https://github.com/Lookuz/VidHal/blob/master/pipelines/inference/random.py). If this implementation path is chosen, add your custom inference pipelines to `pipelines/inference/__init__.py` to allow them to be loaded by our driver scripts. Here's an example: ``` def get_inference_pipeline(name, task) -> VidHalInferencePipeline: return { ... "my_model": { "mcqa": MyMCQAInferencePipeline, "naive_ordering": MyNaiveOrderingInferencePipeline, "relative_ordering": MyRelativeOrderingInferencePipeline }, ... }[name][task] ``` Our inference pipeline code automatically loads the specified models for their corresponding inference tasks. To add your model to the model repository, simply place your model files in the models directory and update the `load_model` function in the `models/__init__.py` file accordingly. For more details, please refer to the code [here](https://github.com/Lookuz/VidHal/blob/master/models/__init__.py). Finally, model responses can be generated by running `inference.py` with the required arguments. An example run is provided below: ``` python inference.py \ --model <my_model> \ --task <task> \ --annotations_path <annotations_path> \ --videos_path <videos_path> \ --save_path <save_path> ``` where `<task>` specifies the evaluation task to be run and is selected from: `mcqa`, `naive_ordering`, or `relative_ordering`. Command-line scripts for running `inference.py` with the desired arguments are also provided in the `scripts/inference` directory. `scripts/<task>/run_random_inference.sh` presents an example for generating random predictions, which can be referenced to create your own driver script. ### Evaluation Once predictions for the selected evaluation task are generated, you can evaluate these responses by running evaluate.py. We demonstrated an example run below: ``` python evaluate.py \ --task <task> \ --annotations_path <annotations_path> \ --predictions_path <path_to_my_model_predictions> # This should be the same as <save_path> in inference ``` Similarly to the inference stage, command-line scripts for running `evaluate.py` are provided in the `scripts/evaluation` directory. ## Models We provide the codebase for evaluating the [VideoChat2](https://github.com/Lookuz/VidHal/blob/master/pipelines/inference/videochat2.py), [VideoLLaMA2](https://github.com/Lookuz/VidHal/blob/master/pipelines/inference/videochat2.py), [mPLUG-Owl3](https://github.com/Lookuz/VidHal/blob/master/pipelines/inference/mplug_owl3.py), and [LLaVA-NeXT-Video](https://github.com/Lookuz/VidHal/blob/master/pipelines/inference/llava.py) models in our paper. The required libraries for each model are installed according to the specifications in their original source code (refer to the acknowledgements section for links to these repositories). Additionally, we include code for performing inference on VidHal using the proprietary models GPT-4o and Gemini, which can be used directly with your corresponding API keys. ## Evaluation Results We evaluate several state-of-the-art video VLMs on the VidHal benchmark and present their results below. | VLM | MCQA | Naive Caption Ordering | Relative Caption Ordering | |:----------------------:|:-----:|:----------------------:|:-------------------------:| | VideoChat2 (Vicuna) | 0.410 | 0.490 | 0.573 | | VideoChat2 (Mistral) | 0.524 | 0.348 | 0.579 | | VideoChat2 (Phi3) | 0.468 | 0.552 | 0.522 | | mPLUG-Owl3 | 0.596 | 0.641 | 0.707 | | LLaVA-NeXT-Video (7B) | 0.509 | 0.518 | 0.620 | | LLaVA-NeXT-Video (32B) | 0.663 | 0.641 | 0.747 | | VideoLLaMA2 (7B) | 0.541 | 0.564 | 0.622 | | VideoLLaMA2 (72B) | 0.647 | 0.787 | 0.760 | | GPT-4o | 0.772 | 0.840 | 0.826 | | Gemini-1.5 Flash | 0.657 | 0.738 | 0.745 | | Gemini-1.5 Pro | 0.671 | 0.765 | 0.753 | # Acknowledgements We sincerely thank the original authors of the following works for making their codebases publicly available, enabling the evaluation of their models on our VidHal benchmark: - [VideoChat2](https://github.com/OpenGVLab/Ask-Anything) - [VideoLLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2) - [mPLUG-Owl3](https://github.com/X-PLUG/mPLUG-Owl) - [LLaVA-NeXT-Video](https://github.com/LLaVA-VL/LLaVA-NeXT) # Citation If you find our work valuable or useful for your research, please consider citing it. ``` @article{choong2024vidhal, title={VidHal: Benchmarking Temporal Hallucinations in Vision LLMs}, author={Wey Yeh Choong and Yangyang Guo and Mohan Kankanhalli}, journal={arXiv preprint arXiv:2411.16771}, year={2024} } ```