VidHal: Benchmarking Hallucinations in Vision LLMs

VidHal is a benchmark designed to evaluate and analyze video-based hallucinations in Vision-Language Models (VLMs). It features a diverse set of videos covering five key temporal aspects: Action, Attribute, Object, Event Order, and Direction. To facilitate fine-grained evaluation of video hallucinations, we introduce a novel task of caption ordering alongside multiple-choice question answering. For more details, refer to our paper: VidHal: Benchmarking Hallucinations in Vision LLMs.

Updates

[02/12/2024] Inference and evaluation code for the evaluated models in our paper are now available.
[25/11/2024] The VidHal dataset and evaluation pipeline are now available. Instructions for evaluating your model using our code can be found in the Model Evaluation section.

Getting Started

Dataset Download

The annotations and pre-defined randomized option orders for VidHal are located under the vidhal folder. The benchmark dataset videos can be downloaded from this link, which should be extracted to vidhal/videos.

Environment Setup

We provide the essential libraries and tools to run our evaluation code in requirements.txt. Install these dependencies along with those needed for your models using pip.

Model Evaluation

We provide code for inference and evaluation on the VidHal benchmark, which can be adapted to suit your model's needs and requirements. Our evaluation pipeline consists of two steps: first, generating model predictions on the VidHal benchmark for a specified evaluation task, and second, comparing the predictions to the ground-truth answers.

Inference

The source code for generating model predictions on VidHal instances is located in pipelines/inference. The skeleton code, including the prompts used in our paper for all evaluation tasks, along with interfaces for running inference, is already implemented. To perform inference on VidHal with your model of choice using our pipeline, you may simply override the code in pipelines/inference/base.py:

class VidHalInferencePipeline:
...
    def format_prompt(
        self, 
        main_prompt, 
        options_prompt, 
        system_prompt=None, 
        *args, **kwargs):
        """
        NOTE: Implement this according to your model requirements

        Expected return type:
            prompts (tuple): Consisting of (main_prompt, system_prompt). If only one prompt is used, system prompt can be left optionally empty
        """
        raise NotImplementedError

    def generate_response(
        self, 
        model, 
        video, 
        main_prompt, system_prompt=None,
        generation_config={},
        *args, **kwargs):
        """
        NOTE: Implement this according to your model requirements

        Expected return type:
            response (str) : Response generated by the model.
        """
        raise NotImplementedError
...

which specifies the prompt format for your model and the response generation logic, respectively. Alternatively, you can create custom inference code by subclassing VidHalInferencePipeline and its task-specific derivatives and implement the above two functions in those files. An example of this approach for random response generation in present here. If this implementation path is chosen, add your custom inference pipelines to pipelines/inference/__init__.py to allow them to be loaded by our driver scripts. Here's an example:

def get_inference_pipeline(name, task) -> VidHalInferencePipeline:
    return {
        ...
        "my_model": {
            "mcqa": MyMCQAInferencePipeline,
            "naive_ordering": MyNaiveOrderingInferencePipeline,
            "relative_ordering": MyRelativeOrderingInferencePipeline
        },
        ...
    }[name][task]

Our inference pipeline code automatically loads the specified models for their corresponding inference tasks. To add your model to the model repository, simply place your model files in the models directory and update the load_model function in the models/__init__.py file accordingly. For more details, please refer to the code here.

Finally, model responses can be generated by running inference.py with the required arguments. An example run is provided below:

python inference.py \
    --model <my_model> \
    --task <task> \
    --annotations_path <annotations_path> \
    --videos_path <videos_path> \
    --save_path <save_path>

where <task> specifies the evaluation task to be run and is selected from: mcqa, naive_ordering, or relative_ordering.

Command-line scripts for running inference.py with the desired arguments are also provided in the scripts/inference directory. scripts/<task>/run_random_inference.sh presents an example for generating random predictions, which can be referenced to create your own driver script.

Evaluation

Once predictions for the selected evaluation task are generated, you can evaluate these responses by running evaluate.py. We demonstrated an example run below:

python evaluate.py \
    --task <task> \
    --annotations_path <annotations_path> \
    --predictions_path <path_to_my_model_predictions> # This should be the same as <save_path> in inference

Similarly to the inference stage, command-line scripts for running evaluate.py are provided in the scripts/evaluation directory.

Models

We provide the codebase for evaluating the VideoChat2, VideoLLaMA2, mPLUG-Owl3, and LLaVA-NeXT-Video models in our paper. The required libraries for each model are installed according to the specifications in their original source code (refer to the acknowledgements section for links to these repositories). Additionally, we include code for performing inference on VidHal using the proprietary models GPT-4o and Gemini, which can be used directly with your corresponding API keys.

Evaluation Results

We evaluate several state-of-the-art video VLMs on the VidHal benchmark and present their results below.

VLM	MCQA	Naive Caption Ordering	Relative Caption Ordering
VideoChat2 (Vicuna)	0.410	0.490	0.573
VideoChat2 (Mistral)	0.524	0.348	0.579
VideoChat2 (Phi3)	0.468	0.552	0.522
mPLUG-Owl3	0.596	0.641	0.707
LLaVA-NeXT-Video (7B)	0.509	0.518	0.620
LLaVA-NeXT-Video (32B)	0.663	0.641	0.747
VideoLLaMA2 (7B)	0.541	0.564	0.622
VideoLLaMA2 (72B)	0.647	0.787	0.760
GPT-4o	0.772	0.840	0.826
Gemini-1.5 Flash	0.657	0.738	0.745
Gemini-1.5 Pro	0.671	0.765	0.753

Acknowledgements

We sincerely thank the original authors of the following works for making their codebases publicly available, enabling the evaluation of their models on our VidHal benchmark:

VideoChat2
VideoLLaMA2
mPLUG-Owl3
LLaVA-NeXT-Video

Citation

If you find our work valuable or useful for your research, please consider citing it.

@article{choong2024vidhal,
    title={VidHal: Benchmarking Temporal Hallucinations in Vision LLMs}, 
    author={Wey Yeh Choong and Yangyang Guo and Mohan Kankanhalli},
    journal={arXiv preprint arXiv:2411.16771},
    year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VidHal: Benchmarking Hallucinations in Vision LLMs

Updates

Getting Started

Dataset Download

Environment Setup

Model Evaluation

Inference

Evaluation

Models

Evaluation Results

Acknowledgements

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

VidHal: Benchmarking Hallucinations in Vision LLMs

Updates

Getting Started

Dataset Download

Environment Setup

Model Evaluation

Inference

Evaluation

Models

Evaluation Results

Acknowledgements

Citation