Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation
Seonghoon Yu, +Paul Hongsuck Seo, +Jeany Son (+ corresponding authors)
AI graduate school, GIST and Korea University
ECCV 2024
Abstract
We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.
# create conda env
conda create -n pseudo_ris python=3.9
# activate the environment
conda activate pseudo_ris
# Install Pytorch
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
# Install required package
pip install pydantic==1.10.11 --upgrade
conda install -c conda-forge spacy
python -m spacy download en_core_web_lg
conda install -c anaconda pandas
pip install opencv-python
pip install lmdb
pip install pyarrow==11.0.0
pip install colored
pip install pycocotools
pip install transformers==4.31
# Install CoCa in a dev mode, where distinctive caption sampling is implemented.
cd third_party/open_clip
pip install -e .
# Install detectron2 for CutLER
cd third_party/detectron2
pip install -e .
# Install CLIP
cd third_party/CLIP
pip install -e .
# Install SAM in a dev mode
cd segment-anything
pip install -e .
We use the pre-trained weights for (1) CoCa, (2) SAM, and (3) CutLER.
Note that, official CoCa repository offers pre-trained model on LAION-2B.
We fine-tune this on CC3M dataset.
We provide CoCa pre-trained weights on LAION-2B and CC3M in the this URL.
Put this in ./third_party/open_clip/src/logs/laion_cc3m/checkpoints/
We use SAM ViT-H model.
# Download SAM ViT-H model.
cd segment-anything
mkdir checkpoints
cd checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
We use CutLER to reduce the excessive number of SAM masks and over-segmented SAM masks to prevent OOM issues, as demonstrated in our supplementary and implementation details.
cd third_party/CuTLER/cutler/
mkdir checkpoints
cd checkpoints
wget http://dl.fbaipublicfiles.com/cutler/checkpoints/cutler_cascade_final.pth
We follow a dataset setup in ETRIS to get unlabeled images in the train set of refcoco+.
├── datasets
│ ├── images
│ │ └── train2014
│ │ ├── COCO_train2014_000000000009.jpg
│ │ └── ...
│ └── lmdb
│ └── refcoco+
│ ├── train.lmdb
│ └── ...
We produce pseudo-masks using SAM and CutLER, as demonstrated in our implementation details and supplementary material.
Pseudo masks are saved in './datasets/pseudo_masks/cutler_sam' directory.
python generate_masks/cutler_sam_masks.py
Pseudo referring texts are saved in './pseudo_supervision/cutler_sam/distinctive_captions_cc3m.csv'
python generate_pseudo_supervision/distinctive_caption_generation.py
@inproceedings{yu2024pseudoris,
title={Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation},
author={Seonghoon Yu and Paul Hongsuck Seo and Jeany Son},
booktitle={Proceedings of the European Conference on Computer Vision},
year={2024}
}
We are thanks to open-source foundation models (CoCa, SAM, CLIP).