This is an open source implementation of Pic2Word. This is not an officially supported Google product.
We utilize Conceptual Captions URLs to train a model. See open_clip to see the process of getting the dataset.
The training data directory has to be in the root of this repo, and should be structured like below.
cc_data
├── train ## training image diretories.
└── val ## validation image directories.
cc
├── Train_GCC-training_output.csv ## training data list
└── Validation_GCC-1.1.0-Validation_output.csv ## validation data list
See README to prepare test dataset.
See open_clip for the details of installation. The same environment should be usable in this repo. setenv.sh is the script we used to set-up the environment in virtualenv.
Also run below to add directory to pythonpath:
. env3/bin/activate
export PYTHONPATH="$PYTHONPATH:$PWD/src"
export PYTHONWARNINGS='ignore:semaphore_tracker:UserWarning'
The model is available in GoogleDrive.
python -u src/main.py \
--save-frequency 1 \
--train-data="cc/Train_GCC-training_output.csv" \
--warmup 10000 \
--batch-size=128 \
--lr=1e-4 \
--wd=0.1 \
--epochs=30 \
--workers=8 \
--openai-pretrained \
--model ViT-L/14
Evaluation on COCO, ImageNet, or CIRR.
python src/eval_retrieval.py \
--openai-pretrained \
--resume /path/to/checkpoints \
--eval-mode $data_name \ ## replace with coco, imgnet, or cirr
--gpu $gpu_id
--model ViT-L/14
Evaluation on fashion-iq (shirt or dress or toptee)
python src/eval_retrieval.py \
--openai-pretrained \
--resume /path/to/checkpoints \
--eval-mode fashion \
--source $cloth_type \ ## replace with shirt or dress or toptee
--gpu $gpu_id
--model ViT-L/14
Evaluation on COCO, ImageNet, or CIRR.
python src/demo.py \
--openai-pretrained \
--resume /path/to/checkpoints \
--retrieval-data $data_name \ ## Choose from coco, imgnet, cirr, dress, shirt, toptee.
--query_file "path_img1,path_img2,path_img3..." \ ## query images
--prompts "prompt1,prompt2,..." \ #prompts. Use * to indicate the token to be replaced with an image token. e.g., "a sketch of *"
--demo-out $path_demo \ # directory to generate html file and image directory.
--gpu $gpu_id
--model ViT-L/14
This demo will generate a directory which includes html file and an image directory. Download the directory and open html to see results.
If you found this repository useful, please consider citing:
@article{saito2023pic2word,
title={Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval},
author={Saito, Kuniaki and Sohn, Kihyuk and Zhang, Xiang and Li, Chun-Liang and Lee, Chen-Yu and Saenko, Kate and Pfister, Tomas},
journal={CVPR},
year={2023}
}