End of studies internship project
This repository is a fork of mlfundation implementation to train and use the CLIP model, please check their README as well as this one!
This aims at delivering a way to use and to finetune the CLIP model.
docker build . -t open-clip
docker run --rm -ti -v ${PWD}:/home/open-clip -v /my_dataset:/home/open-clip/my_dataset open-clip:latest
add --gpus '"device=0,1,2,3"
to use gpus
add -v /dev/shm:/dev/shm
if you are training on multiple gpus (this will enable access to shared memory)
⚠ This project uses python 3.7.
Each of these metrics is available from image to text, and from text to image.
- Median rank
- Mean rank
- R@X accuracy : accuracy on : 'grounth truth in the top-X ranked answers ?'
Open CLIP project tries to reach the same metric presented in OpenAI's paper. You can choose between OpenAI's pre-trained weights and open-clip pre-trained weights, with each available architecture and different pre-trained dataset.
Use the following commands to list the available models and their weights and to load a model.
import open_clip
open_clip.list_pretrained()
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-B-32', pretrained='laion2b_e16')
To load other pre-trained image use pretrained-image
import open_clip
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-B-32', pretrained-image='my_checkpoint_path')
Simple inference
import torch
from PIL import Image
import open_clip
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32-quickgelu', pretrained='laion400m_e32')
image = preprocess(Image.open("CLIP.png")).unsqueeze(0)
text = open_clip.tokenize(["a diagram", "a dog", "a cat"])
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs) # prints: [[1., 0., 0.]]
Evaluation Use the clip_test_acc.py script :
- Compute simple metrics on all dataset (may not fit GPU!)
python retrieval.main
--ground_truth_csv_path GROUND_TRUTH_CSV_PATH \
--csv_img_key CSV_IMG_KEY \
--csv_caption_key CSV_CAPTION_KEY \
--input_dir INPUT_DIR \
--csv_separator CSV_SEPARATOR \
[--network NETWORK] \
[--checkpoint CHECKPOINT] \
[--workers WORKERS] \
[--device DEVICE] \
[--pretrained PRETRAINED] \
[--log_rate LOG_RATE] \
[--tops TOPS] \
where
ground_truth_csv_path
is the csv where you store the image filename, its label and shooting idcsv_img_key
is the name of the column where the filename arecsv_caption_key
is the name of the column where the labels areinput_dir
is the folder where the images are storedcsv_separator
is the separator character of your csv filenetwork
is the name of the network, see Available Models for more precisionscheckpoint
is the filename of the checkpointworkers
are the number of workersdevice
iscpu
orcuda
, default iscpu
pretrained
is the source of the pretrained model, see Available Models for more precisionslog_rate
is the rate for printing the metrics, default is 10tops
is the accuracy tops to compute, to enter with spaces (e.g1 2 4 9
), default1 2 3 5 10
-
Compute average metrics on shootings add
--per_shooting --csv_shooting_key CSV_SHOOTING_KEY
to retrieval command -
From the training main from local checkpoint :
python -m training.main
--val-data="/path/to/validation_data.csv"
--model RN101
--pretrained /path/to/checkpoints/epoch_K.pt
from a hosted pretrained checkpoint
python -m training.main
--imagenet-val /path/to/imagenet/validation
--model ViT-B-32-quickgelu
--pretrained laion400m_e32
All the parameters can be found in training/params.py
Single GPU (example)
python -m training.main \
--save-frequency 1 \
--zeroshot-frequency 1 \
--report-to tensorboard \
--train-data="path to train data csv" \
--val-data="path to validation csv" \
--csv-img-key filepath \
--csv-caption-key title \
--imagenet-val=/path/to/imagenet/root/val/ \
--warmup 10000 \
--batch-size=128 \
--lr=1e-3 \
--wd=0.1 \
--epochs=30 \
--workers=8 \
--model RN50
Multi GPUs (example)
torchrun --nproc_per_node 4 -m training.main \
--train-data="path to train data csv" \
--val-data="path to validation csv" \
--csv-img-key "new filename" \
--csv-caption-key "food label" \
--csv-separator ',' \
--batch-size 128 \
--precision amp \
--workers 4 \
--model ViT-B-32 \
--epochs=40 \
--save-frequency 15 \
--pretrained 'openai' \
--warmup 100 \
--lr 5.0e-5\
--val-frequency 2
🔒 LiT
LiT consist in lock the image tower and unlock the text tower. open-clip offers parameters to use this technique to fine-tune CLIP. Use the following parameters :
--lock-image
to lock full image tower by disabling gradients.--lock-image-unlocked-groups n
to leave last n image tower layer groups unlocked.--lock-image-freeze-bn-stats
to freeze BatchNorm running stats in image tower for any locked layers
Weight and Biases
- Log to weight and biases with
wandb login
- Add
--report-to 'wandb'
in script parameters - Open your WandB dashboard, you're set !
Some script are available inside src/data
for dataset management
gather_cc.py
is an open-clip tool to download conceptual caption dataset.
Articles
- CLIP , article, original code
- LiT, Zero-Shot Transfer with Locked-image text Tuning, article, code
Repositories