You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While pre-training on a custom image-text dataset, I had some concerns with the implementation of both the CaptionDataset class and blip2_qformer.py file for handling the captioning datasets.
If you look at the blip2_qformer.py implementation, line 159, the if statement had this comment for using the image_id only for retrieval tasks, by checking the if the "image_id" is in the sample keys; Same with line 180.
if "image_id" in samples.keys(): #coco retrieval finetuning
image_ids = samples["image_id"].view(-1,1)
...
loss_itc = ...
else:
loss_itc = ...
These two if statements trigger erros with custom image-text captioning dataset, idk how it didn't trigger an error using coco_caption_dataset.py as the COCOCapDataset is using CaptionDataset class implementation, and it is returning the image_id when getting an item.
By commenting the if statement (True) blocks in line 159 and 180, the pre-training on stage 1 with custom datasets runs perfectly. Is this an expected behaviour or am I missing something?
Also, samples["image_id"] seems to be a list of strings, even with coco file naming pattern, when getting an item using the custom dataset implementation, it returns a string as an id, so anything inside the if (true) blocks mentioned previously will cause an error (i.e. samples["image_id"].view(-1,1) is a list of strings, not a tensor of int).
The text was updated successfully, but these errors were encountered:
Following up on the second issue, a minor modification that I tried worked when creating a custom captioning dataset, to return a unique numerical id based on the original implementation of self.img_ids = {} loop.
return {
"image": image,
"text_input": caption,
"image_id": self.img_ids[ann["image_id"]] # this is the main difference, return the unique numerical id
}
While pre-training on a custom image-text dataset, I had some concerns with the implementation of both the CaptionDataset class and blip2_qformer.py file for handling the captioning datasets.
If you look at the
blip2_qformer.py
implementation, line 159, the if statement had this comment for using the image_id only for retrieval tasks, by checking the if the "image_id" is in the sample keys; Same with line 180.These two if statements trigger erros with custom image-text captioning dataset, idk how it didn't trigger an error using coco_caption_dataset.py as the
COCOCapDataset
is usingCaptionDataset
class implementation, and it is returning theimage_id
when getting an item.By commenting the if statement (True) blocks in line 159 and 180, the pre-training on stage 1 with custom datasets runs perfectly. Is this an expected behaviour or am I missing something?
Also, samples["image_id"] seems to be a list of strings, even with coco file naming pattern, when getting an item using the custom dataset implementation, it returns a string as an id, so anything inside the if (true) blocks mentioned previously will cause an error (i.e. samples["image_id"].view(-1,1) is a list of strings, not a tensor of int).
The text was updated successfully, but these errors were encountered: