You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the amazing project. I was looking at the dataset and I've found some instances in wich the text description is heavily hallucinated. For instance, take an image with image_id=52,
The description says
{
"image_id": "52",
"caption": "The man in the image is wearing a yellow shirt and brown pants. He is holding a trophy in his left hand and smiling at the camera. There is a red carpet on the ground in front of him. Behind him, there is a wall with a banner that reads 'Indian Film Academy Awards' in white letters. There are several people in the background, some of whom are clapping and others are standing around. The overall mood of the image is celebratory and joyful."
},
All the bold text is not something that we can see from the image.
smiling
There is a red carpet on the ground in front of him
Behind him, there is a wall with a banner that reads 'Indian Film Academy Awards' in white letters
There are several people in the background, some of whom are clapping and others are standing around. The overall mood of the image is celebratory and joyful."
So, 80% of the caption is wrong - this is true for most images in the dataset
What is going on here? 😅
Am I missing something?
Thanks a lot
Fra
The text was updated successfully, but these errors were encountered:
Thank you for pointing out this! The description contents come from the stage 1 model and contain some hallucinations. The role of this small dataset is not to teach the model to describe the image correctly but teach the model how to speak in a human-preferred way. Therefore, although the dataset contains some hallucinations, it still successfully makes the stage-2 model good at speaking. I think this is a good point and we will update our paper in the next version to make this clear
Hi @TsuTikgiau thanks for the reply. (Consider I am not a researcher) But I am not sure this makes sense. The goal is to "translate" images features to something that the text encoder can process, the proposed approach works since half of the caption is always true (and the other is hallucinated).
In my personal experiments, when asking the model e.g. to read something in an image or to describe a specific location, e.g. "tell me in detail what you see on the top left etc" the response may contain hallucination. I do believe this is enhanced by the dataset itself
I think this could be a good opportunity to try and fix the caption. I think a smaller number of correct captions may work well as well, e.g. 1k.
Would it be possible to publish the dataset before it went through the chatGPT process? Maybe with a better prompt we can prevent the hallucination part
Hi there 👋
Thank you for the amazing project. I was looking at the dataset and I've found some instances in wich the text description is heavily hallucinated. For instance, take an image with
image_id=52
,The description says
All the bold text is not something that we can see from the image.
So, 80% of the caption is wrong - this is true for most images in the dataset
What is going on here? 😅
Am I missing something?
Thanks a lot
Fra
The text was updated successfully, but these errors were encountered: