Weird descriptions in the dataset #61

FrancescoSaverioZuppichini · 2023-04-19T20:31:31Z

Hi there 👋

Thank you for the amazing project. I was looking at the dataset and I've found some instances in wich the text description is heavily hallucinated. For instance, take an image with image_id=52,

The description says

{
      "image_id": "52",
      "caption": "The man in the image is wearing a yellow shirt and brown pants. He is holding a trophy in his left hand and smiling at the camera. There is a red carpet on the ground in front of him. Behind him, there is a wall with a banner that reads 'Indian Film Academy Awards' in white letters. There are several people in the background, some of whom are clapping and others are standing around. The overall mood of the image is celebratory and joyful."
    },

All the bold text is not something that we can see from the image.

smiling
There is a red carpet on the ground in front of him
Behind him, there is a wall with a banner that reads 'Indian Film Academy Awards' in white letters
There are several people in the background, some of whom are clapping and others are standing around. The overall mood of the image is celebratory and joyful."

So, 80% of the caption is wrong - this is true for most images in the dataset

What is going on here? 😅

Am I missing something?

Thanks a lot

Fra

The text was updated successfully, but these errors were encountered:

TsuTikgiau · 2023-04-26T14:34:55Z

Thank you for pointing out this! The description contents come from the stage 1 model and contain some hallucinations. The role of this small dataset is not to teach the model to describe the image correctly but teach the model how to speak in a human-preferred way. Therefore, although the dataset contains some hallucinations, it still successfully makes the stage-2 model good at speaking. I think this is a good point and we will update our paper in the next version to make this clear

FrancescoSaverioZuppichini · 2023-04-27T07:28:55Z

Hi @TsuTikgiau thanks for the reply. (Consider I am not a researcher) But I am not sure this makes sense. The goal is to "translate" images features to something that the text encoder can process, the proposed approach works since half of the caption is always true (and the other is hallucinated).

In my personal experiments, when asking the model e.g. to read something in an image or to describe a specific location, e.g. "tell me in detail what you see on the top left etc" the response may contain hallucination. I do believe this is enhanced by the dataset itself

I think this could be a good opportunity to try and fix the caption. I think a smaller number of correct captions may work well as well, e.g. 1k.

Would it be possible to publish the dataset before it went through the chatGPT process? Maybe with a better prompt we can prevent the hallucination part

Curious to know what you think :)

Thanks,

Fra

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird descriptions in the dataset #61

Weird descriptions in the dataset #61

FrancescoSaverioZuppichini commented Apr 19, 2023 •

edited

Loading

TsuTikgiau commented Apr 26, 2023

FrancescoSaverioZuppichini commented Apr 27, 2023

Weird descriptions in the dataset #61

Weird descriptions in the dataset #61

Comments

FrancescoSaverioZuppichini commented Apr 19, 2023 • edited Loading

TsuTikgiau commented Apr 26, 2023

FrancescoSaverioZuppichini commented Apr 27, 2023

FrancescoSaverioZuppichini commented Apr 19, 2023 •

edited

Loading