Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird descriptions in the dataset #61

Open
FrancescoSaverioZuppichini opened this issue Apr 19, 2023 · 2 comments
Open

Weird descriptions in the dataset #61

FrancescoSaverioZuppichini opened this issue Apr 19, 2023 · 2 comments

Comments

@FrancescoSaverioZuppichini
Copy link

FrancescoSaverioZuppichini commented Apr 19, 2023

Hi there 👋

Thank you for the amazing project. I was looking at the dataset and I've found some instances in wich the text description is heavily hallucinated. For instance, take an image with image_id=52,

52

The description says

{
      "image_id": "52",
      "caption": "The man in the image is wearing a yellow shirt and brown pants. He is holding a trophy in his left hand and smiling at the camera. There is a red carpet on the ground in front of him. Behind him, there is a wall with a banner that reads 'Indian Film Academy Awards' in white letters. There are several people in the background, some of whom are clapping and others are standing around. The overall mood of the image is celebratory and joyful."
    },

All the bold text is not something that we can see from the image.

  • smiling
  • There is a red carpet on the ground in front of him
  • Behind him, there is a wall with a banner that reads 'Indian Film Academy Awards' in white letters
  • There are several people in the background, some of whom are clapping and others are standing around. The overall mood of the image is celebratory and joyful."

So, 80% of the caption is wrong - this is true for most images in the dataset

What is going on here? 😅

Am I missing something?

Thanks a lot

Fra

@TsuTikgiau
Copy link
Collaborator

Thank you for pointing out this! The description contents come from the stage 1 model and contain some hallucinations. The role of this small dataset is not to teach the model to describe the image correctly but teach the model how to speak in a human-preferred way. Therefore, although the dataset contains some hallucinations, it still successfully makes the stage-2 model good at speaking. I think this is a good point and we will update our paper in the next version to make this clear

@FrancescoSaverioZuppichini
Copy link
Author

Hi @TsuTikgiau thanks for the reply. (Consider I am not a researcher) But I am not sure this makes sense. The goal is to "translate" images features to something that the text encoder can process, the proposed approach works since half of the caption is always true (and the other is hallucinated).

In my personal experiments, when asking the model e.g. to read something in an image or to describe a specific location, e.g. "tell me in detail what you see on the top left etc" the response may contain hallucination. I do believe this is enhanced by the dataset itself

I think this could be a good opportunity to try and fix the caption. I think a smaller number of correct captions may work well as well, e.g. 1k.

Would it be possible to publish the dataset before it went through the chatGPT process? Maybe with a better prompt we can prevent the hallucination part

Curious to know what you think :)

Thanks,

Fra

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants