Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it better to breakdown one image's different boxes into multiple questions?[Question] #23

Open
james20141606 opened this issue Jul 17, 2024 · 3 comments

Comments

@james20141606
Copy link

Question

Hi, When I try to look into the json file for finetuning, I find things like:

 {
        "id": "VG_100K_2/2412354",
        "image": "vg/VG_100K_2/2412354.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nPlease provide a short description for this region: [0.32, 0.8, 0.59, 1.0]."
            },
            {
                "from": "gpt",
                "value": "The cupboards are white."
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Bathroom floor is dark tiles."
            },
            {
                "from": "gpt",
                "value": "[0.21, 0.92, 0.36, 1.0]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Lights of bathroom."
            },
            {
                "from": "gpt",
                "value": "[0.47, 0.19, 0.68, 0.27]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: small hole in bathroom door."
            },
            {
                "from": "gpt",
                "value": "[0.7, 0.86, 0.74, 0.91]"
            },
            {
                "from": "human",
                "value": "Please provide a short description for this region: [0.29, 0.62, 0.72, 0.94]."
            },
            {
                "from": "gpt",
                "value": "Bathroom counter top is green."
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Two handles on the vanity."
            },
            {
                "from": "gpt",
                "value": "[0.43, 0.87, 0.54, 0.96]"
            },
            {
                "from": "human",
                "value": "Please provide a short description for this region: [0.32, 0.65, 0.71, 0.92]."
            },
            {
                "from": "gpt",
                "value": "Counter top is made of granite."
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Man has dark colored hair."
            },
            {
                "from": "gpt",
                "value": "[0.19, 0.49, 0.32, 0.65]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Bathroom lights are on."
            },
            {
                "from": "gpt",
                "value": "[0.49, 0.22, 0.74, 0.29]"
            },
            {
                "from": "human",
                "value": "Please provide the bounding box coordinate of the region this sentence describes: Man is wearing eyeglasses."
            },
            {
                "from": "gpt",
                "value": "[0.18, 0.48, 0.32, 0.64]"
            }
        ]
    },

I wonder if it is better to split this question to multiple questions so each question contain only one bounding box to make model less puzzled? Although it will increase the training load a lot, could it somehow make the model understand the bounding box better?

@mu-cai
Copy link
Collaborator

mu-cai commented Jul 17, 2024

Great question! This could be the reason!
For REC tasks like this item, the data is directly borrowed from LLaVA-1.5.
I think the reason for LLaVA to do this concatenation is to save training time.
But when breaking down the conversations, I guess it will definitely bring benefits for REC task!

@james20141606
Copy link
Author

Thanks for your comment! I agree it might be benificial but it is indeed too time consuming. BTW i guess change epoch_num from 1 to larger number might have the chance to increase the performance a little bit? Not sure if it is worthwhile to do so.

@mu-cai
Copy link
Collaborator

mu-cai commented Jul 29, 2024

Yeah You can increase epoch. But there is a chance of overfitting to the model. What I mean is that the generalization issue may arise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants