ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models
This repository provides the official PyTorch implementation of the following paper:
ConVis: Contrastive Decoding with Hallucination Visualization for Mitigating Hallucinations in Multimodal Large Language Models
Yeji Park†, Deokyeong Lee†, Junsuk Choe, Buru Chang
Sogang University
† These authors contributed equally to this work.
Our implementation is built upon several existing repositories. Specifically, we have borrowed and adapted code from the following sources:
We sincerely appreciate the authors for their foundational work.
-
HALC: Used as the foundational base for our implementation.
-
LLaVA: Utilized for evaluating models on the LLaVA-Bench benchmark.
-
HallusionBench: Integrated for evaluating models on the HallusionBench benchmark.
- Installation
- Download Datasets
- Prepare T2I Generated Images
- Prepare MLLM Checkpoints
- Evaluation
- Demo Playground
- License
We provide the Dockerfile that includes the environment that you need. This docker image is based on Ubuntu 22.04 and CUDA 12.0.0.
To install, run the following commands to build the environment:
- Clone the repository locally.
git clone <current repo>
- Build the Docker image.
docker build -t convis:<your_tag> .
- Run the container.
docker run -itd --name <container name> -v <local repo path>:/root/share/ -p 14352:8888 -p 14353:8889 -p 14354:8890 --shm-size=128G --gpus all -m "128G" --restart=always --ipc=host convis:<your_tag> /bin/bash -c "pip install -e /root/share/transformers-4.36.2 && tail -f /dev/null"
- Open the container.
docker exec -it <container name> /bin/bash
You have to download MSCOCO 2014 dataset for CHAIR / POPE evaluation. Please download and extract it in your data path.
To ensure the dataset is organized correctly, follow the structure below:
COCO2014/
├── annotations/
│ ├── captions_val2014.json
│ ├── captions_train2014.json
│ ├── instances_train2014.json
│ ├── instances_val2014.json
│ ├── person_keypoints_train2014.json
│ └── person_keypoints_val2014.json
├── train2014/
│ ├── COCO_train2014_000000000001.jpg
│ ├── COCO_train2014_000000000002.jpg
│ └── ...
├── val2014/
│ ├── COCO_val2014_000000000042.jpg
│ ├── COCO_val2014_000000000073.jpg
│ └── ...
└── ...
First, clone the original HallusionBench repository.
git clone https://github.com/tianyi-lab/HallusionBench.git
Next, download the dataset from the following official HallusionBench dataset link and place it under the HallusionBench directory, maintaining the structure as follows:
HallusionBench/
└── hallusion_bench/
Please follow the instruction from the ofiicial repository of MME benchmark to download the dataset.
To download the LLaVA-Bench (In-the-Wild), run the following code:
apt-get install git-lfs && \
git clone https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild && \
cd llava-bench-in-the-wild && \
wget https://github.com/haotian-liu/LLaVA/raw/main/llava/eval/table/rule.json
Our method requires T2I-generated images during the decoding phases. We provide the captions used to generate the images in our experiments.
-
Download the Captions and Images
Download the.zip
file containing the T2I-generated images from the following link. -
Unzip the File
Unzip the downloaded.zip
file under the root directory of this repository. -
Generate Images
Ensure you are in the root directory of this project. Then, run the following command to generate the images:bash image_generation.sh
Other models automatically download checkpoints from Huggingface when executing the code. However, the MiniGPT-4 weights needs to be downloaded separately from official MiniGPT-4 7B pretrained weights for LlaMA-2.
Please follow these steps:
-
Download the checkpoint from link above:
-
Generate the folder
model_checkpoints
:- Create a directory named
model_checkpoints
in the current working directory. - You can use the following command in your terminal to create the directory:
mkdir -p ./model_checkpoints
- Create a directory named
-
Move the downloaded checkpoint to the
model_checkpoints
folder:- Move the downloaded file into the newly created
model_checkpoints
directory. You can use the following command:
mv path/to/downloaded/checkpoint ./model_checkpoints/
- Move the downloaded file into the newly created
For the sake of brevity and clarity, we have only included instructions specific to executing our method in this README. If you are interested in learning how other methods work, please refer to the HALC README for additional details and arguments.
There are 3 steps for evaluation on CHAIR benchmark.
- Generate the caption with MLLM models using our decoding method.
Argument | Example | Description |
---|---|---|
--model |
llava-1.5 |
Specify the MLLM model, this codebase supports minigpt4 , llava-1.5 , mPLUG-Owl2 . |
--decoder |
convis |
Choose decoding strategy to use, Default is ours convis . |
--data_path |
/path/to/dataset |
Path to the dataset file or folder, e.g., COCO2014/val2014 . |
--annotation_path |
/path/to/dataset/annotation |
Path to the dataset file or folder, e.g., COCO2014/annotations . |
--output_dir |
./generated_captions |
Directory to save the generated captions. |
--images_path |
./generated_images/CHAIR |
Path where the T2I generated images are stored. |
- Example:
python run_scripts/caption_generation.py --model llava-1.5 --decoder convis --data_path COCO2014/val2014 --annotation_path COCO2014/annotations --output_dir ./generated_captions --images_path ./generated_images/CHAIR
- Generate the caption into CHAIR json file.
Argument | Example | Description |
---|---|---|
-c |
path/to/caption |
Path to the caption json. |
--annotation_path |
/path/to/dataset/annotation |
Path to the dataset file or folder, e.g., COCO2014/annotations . |
- Example:
python eval/caption_to_chair.py -c ./generated_captions/llava-1.5/convis_generated_captions.json --annotation_path COCO2014/annotations
-
The converted CHAIR file will be located in the same folder as the caption file.
-
For your information, converting caption to the CHAIR json file could be time consuming. If you want to convert more captions at once, consider modifying the value in the following file:
- File:
eval/caption_to_chair.py
- Line: 141
- File:
-
Evaluate the CHAIR json file.
- Example:
python eval/eval_hallucination.py --metric chair --chair_input_path ./generated_captions/llava-1.5/convis_chair.json --data_dir COCO2014
There are 2 steps for evaluation on HallusionBench evaluation.
- Generate the caption for HallusionBench.
Arguments are as much as the same with CHAIR.
- Example:
python run_scripts/hallusion_eval.py --model llava-1.5 --decoder convis --data_path ./HallusionBench --output_dir ./generated_captions --images_path ./generated_images/HallusionBench
- Run the GPT-4V evaluation
Note that you need open-ai API key to evaluate with GPT-4V.
Please write your open-ai API key in
-
File:
eval/utils.py
-
Line: 15
-
Example:
-
python eval/hallusion_evaluation.py --model llava-1.5 --decoder convis
To evaluate POPE Score,
- Example:
python run_scripts/pope_eval.py --model llava-1.5 --decoder convis --pope_type random --data_path COCO2014/val2014/ --images_path ./generated_images/POPE
Please follow the steps to get MME evaluation results.
-
Generate all the response for all mme_type with below code.
- Example:
#!/bin/bash
mme_type_list=("existence" "count" "position" "color" "posters" "celebrity" "scene" "landmark"
"artwork" "OCR" "commonsense_reasoning" "numerical_calculation"
"text_translation" "code_reasoning")
for mme_type in "${mme_type_list[@]}"
do
python run_scripts/mme_eval.py --mme_type "$mme_type" --data_path ./MME_benchmark --output_dir ./generated_captions --images_path ./generated_images/MME
done
-
Evaluate the responses.
- Example:
python eval/MME_score.py --results_dir generated_captions/mme/llava-1.5/convis
-
Generate response for evaluation.
- Example:
python run_scripts/llava_bench_eval.py --model llava-1.5 --decoder convis --data_path ./LLaVA-Bench --gpu-id 0 --output_dir ./generated_captions --images_path ./generated_images/LLaVA-Bench
- Evaluate with GPT-4
-
Note that you need open-ai API key as same as HallusionBench evaluation.
-
Example:
-
python eval/eval_gpt_review_bench.py --question llava-bench-in-the-wild/questions.jsonl --context llava-bench-in-the-wild/context.jsonl --rule llava-bench-in-the-wild/rule.jsonl --answer-list llava-bench-in-the-wild/answers_gpt4.jsonl generated_captions/llava-bench/llava-1.5/convis_generated_captions.json --output generated_captions/llava-bench/llava-1.5/convis_gpt_review.jsonl && \
python eval/summarize_gpt_review.py -f generated_captions/llava-bench/llava-1.5/convis_gpt_review.jsonl
Try your own image for fun!
-
Generate the caption first without our decoding method.
- Example:
python run_scripts/demo_inference.py --data_path ./playground --output_dir ./playground -d greedy
-
Generate the image with caption.
- Example:
python run_scripts/image_generation.py --benchmark_name demo --caption_path ./playground/greedy_generated_caption.json --output_path ./playground
- Generated image should be located in the same folder as your own image.
- Change your generated image file name by appending
_t2i
to the file name you want.
Example:
your/file/directory/your_filename.jpg
then generated file name should be
your/file/directory/your_filename_t2i.png
- Now you can play with your own generated images.
python run_scripts/demo_inference.py --data_path [path/to/image/dir] --output_dir [path/to/output] -d convis
This repository is under MIT License.