Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue related to BLIP2 CaptionDataset implementation or blip2_qformer.py for custom dataset pre-training stage 1 #772

Open
abdel-habib opened this issue Dec 4, 2024 · 1 comment

Comments

@abdel-habib
Copy link

abdel-habib commented Dec 4, 2024

While pre-training on a custom image-text dataset, I had some concerns with the implementation of both the CaptionDataset class and blip2_qformer.py file for handling the captioning datasets.

If you look at the blip2_qformer.py implementation, line 159, the if statement had this comment for using the image_id only for retrieval tasks, by checking the if the "image_id" is in the sample keys; Same with line 180.

if "image_id" in samples.keys(): #coco retrieval finetuning
      image_ids = samples["image_id"].view(-1,1)
            ...
      loss_itc = ...
else:                     
      loss_itc = ...

These two if statements trigger erros with custom image-text captioning dataset, idk how it didn't trigger an error using coco_caption_dataset.py as the COCOCapDataset is using CaptionDataset class implementation, and it is returning the image_id when getting an item.

By commenting the if statement (True) blocks in line 159 and 180, the pre-training on stage 1 with custom datasets runs perfectly. Is this an expected behaviour or am I missing something?

Also, samples["image_id"] seems to be a list of strings, even with coco file naming pattern, when getting an item using the custom dataset implementation, it returns a string as an id, so anything inside the if (true) blocks mentioned previously will cause an error (i.e. samples["image_id"].view(-1,1) is a list of strings, not a tensor of int).

@abdel-habib
Copy link
Author

abdel-habib commented Dec 4, 2024

Following up on the second issue, a minor modification that I tried worked when creating a custom captioning dataset, to return a unique numerical id based on the original implementation of self.img_ids = {} loop.

        return {
            "image": image,
            "text_input": caption,
            "image_id": self.img_ids[ann["image_id"]] # this is the main difference, return the unique numerical id
        }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant