This repository contains the implementation of the paper:
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
Junjie Wu*, Tsz Ting Chung*, Kai Chen*, Dit-Yan Yeung
*Equal Contribution
The Tri-HE benchmark data contains 300 GQA images, each quipped with several elaborately crafted questions, along with reference triplets and objects, enabling unified assessment of LVLM's hallucination.
- Data Structure: Each entry has a unique ID as the key of the image (e.g.,
"2374892"
), containing the following main elements:-
instance
: A list of question-answer pairs. Each item in this list contains:question
: A question that is related to the given image.answer
: The ground truth answer of the question.triplet
: Triplets that support the ground truth answer.
-
triplets
: All triplets that are related to the given image. -
all_object
: All objects identified in the image. -
object
: A subset of key objects or entities directly related to the question-answer pairs and triplet relations (e.g.,["laptop", "illuminating room", "pen"]
).
-
An example data entry for a single instance is as follows:
{
"2374892": {
"instance": [
{
"question": "Why is the room possibly illuminated?",
"answer": "The room is possibly illuminated because there is a lamp turned on that is lighting the area.",
"triplet": ["(Lamp, turned on, illuminating room)"],
},
...
],
"triplets": [
"(laptop, is, black)",
"(shirt, is, white)",
"(Lamp, turned on, illuminating room)",
...
],
"all_object": [
"laptop", "screen", "hand", "pen", "lamp", "desk", "man", ...
],
"object": [
"laptop", "illuminating room", "pen", "desk", "man", ...
]
}
}
All the 300 images are stored in data/usage_figures
.
The first step of the experiment is to obtain an LVLM's response of all the questions. We suggest you directly store the LVLM's answer as an item after each "question" item in the "tri-he.json" file. This helps you better align your results with the following codes. For example, you can store it with the key xxx_triplets
, where xxx
is the name of the evaluated LVLM.
Next, we extract triplets from LVLM's response using GPT-4, which can be done by running the following.
python extract_triplet.py
Note that you need an access to the OpenAI's API with you own openai key, which should be added in line 63 of bschecker/utils.py
. The extracted triplets will be stored in 'xxx_with_triplets.json
with the xxx_triplets
key.
Finally, we can use both GPT-4 judge and NLI judge to obtain hallucination judgements of the extracted triplets.
To use GPT-4 judge, run
python gpt4_judge.py
where openai key is also needed.
To use NLI judge, run
python NLI_judge.py --model xxx
To calculate the Hallu-I and Hallu-Q scores of different types of hallucinations, run
cd evaluation
python main_evaluation.py
To obtain a result with Triplet Description
in the LLaVA-1.5 model, run
python mitigate_llava15.py
To obtain results with Triplet Description
in the MiniGPT-4 model, run
python mitigate_minigpt4.py
@article{wu2024unified,
title={Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models},
author={Wu, Junjie and Chung, Tsz Ting and Chen, Kai and Yeung, Dit-Yan},
journal={arXiv preprint arXiv:2410.23114},
year={2024}
}