EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun¹, Yuxin Fang^2,1, Ledell Wu¹, Xinlong Wang¹, Yue Cao¹

We launch EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

Notably, using exclusively publicly accessible training data, our large-sized EVA-02 CLIP-L/14 can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-modeld CLIP with only ~1/6 parameters and ~1/6 image-text training data. Our largest 5.0B-parameter EVA-02 CLIP-E/14 with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K.

Table of Contents

Summary of EVA-CLIP performance
Model Card
- EVA-01-CLIP Series
- EVA-02-CLIP Series
Setup
Usage
Evaluation of Zero-shot Image Classification Performance
- Evaluate EVA-CLIP on IN-1K
Pre-training
- Pre-train EVA-CLIP on LAION-2B dataset
Extracting image and text features
BibTeX & Citation
Acknowledgement

Summary of EVA-CLIP performance

Summary of CLIP models' ImageNet-1K zero-shot classification performance. The diameter of each circle corresponds to forward GFLOPs x the number of training samples.

Model Card

model name is the checkpoint name of EVA-CLIP.

image enc. init. ckpt is the checkpoint name that is used to initialize the image encoder of EVA-CLIP.

text enc. init. ckpt is the checkpoint name that is used to initialize the text encoder of EVA-CLIP.

weight is the link to download the checkpoint of EVA-CLIP.

EVA-01-CLIP Series

Image encoder MIM teacher: OpenAI CLIP-Large.

model name	image enc. init. ckpt	text enc. init. ckpt	total #params	training precision	training data	training batch size	gpus for training	IN-1K zero-shot top-1	MSCOCO T2I R@5	weight
`EVA01_CLIP_g_14_psz14_s11B`	`EVA01_g_psz14`	`openai/clip-vit-large-patch14`	1.1B	`fp16`	LAION-400M	41K	256 A100(40GB)	78.5	68.5	🤗 HF link (`2.2GB`)
`EVA01_CLIP_g_14_plus_psz14_s11B`	`EVA01_g_psz14`	`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`	1.3B	`fp16`	Merged-2B	114K	112 A100(40GB)	79.3	74.0	🤗 HF link (`2.7GB`)

EVA-02-CLIP Series

Image encoder MIM teacher: EVA01_CLIP_g_14_psz14_s11B.

model name	image enc. init. ckpt	text enc. init. ckpt	total #params	training precision	training data	training batch size	gpus for training	IN-1K zero-shot top-1	MSCOCO T2I R@5	weight
`EVA02_CLIP_B_psz16_s8B`	`EVA02_B_psz14to16`	`openai/clip-vit-base-patch16`	149M	`fp16`	Merged-2B	131K	64 A100(40GB)	74.7	66.9	🤗 HF link (`300MB`)
`EVA02_CLIP_L_psz14_s4B`	`EVA02_L_psz14`	`openai/clip-vit-large-patch14`	428M	`fp16`	Merged-2B	131K	128 A100(40GB)	79.8	71.2	🤗 HF link (`856MB`)
`EVA02_CLIP_L_336_psz14_s6B`	`EVA02_CLIP_L_psz14_224to336`	`EVA02_CLIP_L_psz14_224to336`	428M	`fp16`	Merged-2B	61K	128 A100(40GB)	80.4	71.7	🤗 HF link (`856MB`)
`EVA02_CLIP_E_psz14_s4B`	`EVA02_E_psz14`	`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`	4.7B	`fp16`	LAION-2B	115K	144 A100(80GB)	81.9	74.7	🤗 HF link (`9.4GB`)
`EVA02_CLIP_E_psz14_plus_s9B`	`EVA02_E_psz14`	`laion/CLIP-ViT-bigG-14-laion2B-39B-b160k`	5.0B	`bf16`	LAION-2B	144K	144 A100(80GB)	82.0	75.0	🤗 HF link (`10.1GB`)

The download links of image enc. init. ckpt and text enc. init. ckpt are summarized at here.
To construct Merged-2B, we merged 1.6 billion samples from LAION-2B dataset with 0.4 billion samples from COYO-700M.
To our knowledge, EVA-CLIP series are the most performant open-modeld CLIP models at all scales, evaluated via zero-shot classification performance, especially on mainstream classification benchmarks such as ImageNet along with its variants. For more details about EVA-CLIP, please refer to our paper.

Setup

First, clone the repo and install required packages:

conda create --name rei python=3.8 -y
conda activate rei

git clone git@github.com:baaivision/EVA.git
cd EVA/EVA-CLIP
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

Then, install Apex and xFormer following the official instruction.

Core packages:

Pytorch version 1.12.1
torchvision version 0.13.0
timm version 0.5.4
DeepSpeed version 0.6.5 (fp16 training and ZeRO optimizer)
Apex (fused layer norm)
xFormer (fast and memory efficient MHSA)

Usage

import torch
from eva_clip import create_model_and_transforms, get_tokenizer
from PIL import Image

model_name = "EVA02-CLIP-B-16" 
pretrained = "eva_clip" # or "/path/to/EVA02_CLIP_B_psz16_s8B.pt"

image_path = "CLIP.png"
caption = ["a diagram", "a dog", "a cat"]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = create_model_and_transforms(model_name, pretrained, force_custom_clip=True)
tokenizer = get_tokenizer(model_name)
model = model.to(device)

image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
text = tokenizer(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[0.8275, 0.1372, 0.0352]]

Evaluation of Zero-shot Image Classification Performance

Evaluate EVA-CLIP on IN-1K

We use the standard IN-1K dataset (1.2M images). Download it from http://image-net.org. Then, move and extract the training and validation images to labeled subfolders, using the shell script.

Evaluate the EVA01_CLIP_g_14_psz14_s11B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA01-CLIP-g-14

PRETRAINED=/path/to/EVA01_CLIP_g_14_psz14_s11B.pt

# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Evaluate the EVA01_CLIP_g_14_plus_psz14_s11B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA01-CLIP-g-14-plus

PRETRAINED=/path/to/EVA01_CLIP_g_14_plus_psz14_s11B.pt

# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Evaluate the EVA02_CLIP_B_psz16_s8B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA02-CLIP-B-16

PRETRAINED=/path/to/EVA02_CLIP_B_psz16_s8B.pt
# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Evaluate the EVA02_CLIP_L_psz14_s4B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA02-CLIP-L-14

PRETRAINED=/path/to/EVA02_CLIP_L_psz14_s4B.pt
# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Evaluate the EVA02_CLIP_L_336_psz14_s6B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA02-CLIP-L-14-336

PRETRAINED=/path/to/EVA02_CLIP_L_336_psz14_s6B.pt
# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Evaluate the EVA02_CLIP_E_psz14_s4B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA02-CLIP-bigE-14

PRETRAINED=/path/to/EVA02_CLIP_E_psz14_s4B.pt
# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Evaluate the EVA02_CLIP_E_psz14_plus_s9B on IN-1K val using a single node with 1 gpu (click to expand).

MODEL_NAME=EVA02-CLIP-bigE-14-plus

PRETRAINED=/path/to/EVA02_CLIP_E_psz14_plus_s9B.pt
# can set PRETRAINED=eva to automaticaly download and load weights; please check details in pretrained.py
# PRETRAINED=eva_clip

DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=1 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --imagenet-val ${DATA_PATH} \
        --model ${MODEL_NAME} \
        --pretrained ${PRETRAINED} \
        --force-custom-clip \
        --enable_deepspeed

Pre-training

Pre-train EVA-CLIP on LAION-2B dataset

We provide instruction of pre-training EVA-CLIP on LAION-2B dataset and Merged-2B dataset (coming very soon).

Please prepare LAION-2B dataset and COYO-700M dataset.

To construct Merged-2B, merging 1.6 billion random samples from LAION-2B dataset with 0.4 billion random samples from COYO-700M.

Please prepare EVA-01, EVA-02, Openai CLIP and Open CLIP models.

model name	total #params	training precision	download link
`EVA01_g_psz14`	1.0B	`fp16`	🤗 HF link (`2.0GB`)
`EVA02_B_psz14to16`	86M	`fp16`	🤗 HF link (`176MB`)
`EVA02_L_psz14`	304M	`fp16`	🤗 HF link (`609MB`)
`EVA02_CLIP_L_psz14_224to336`	428M	`fp16`	🤗 HF link (`857MB`)
`EVA02_E_psz14`	4.4B	`fp16`	🤗 HF link (`8.7GB`)
`openai/clip-vit-base-patch16`	149M	`fp16`	🤗 HF link (`599MB`)
`openai/clip-vit-large-patch14`	428M	`fp16`	🤗 HF link (`1.7GB`)
`laion/CLIP-ViT-H-14-laion2B-s32B-b79K`	1.0B	`bf16`	🤗 HF link (`3.9GB`)
`laion/CLIP-ViT-bigG-14-laion2B-39B-b160k`	1.8B	`bf16`	🤗 HF link part1 part2(`9.9GB`+`169M`)

EVA02_B_psz14to16 interpolates the kernel size of patch_embed from 14x14 to 16x16, and interpolate the pos_embed from 16x16 to 14x14.
EVA02_CLIP_L_psz14_224to336 interpolates the pos_embed from 16x16 to 24x24 for training EVA02_CLIP_L_336_psz14_s6B.
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k consists of 2 parts of weights, part1 and part2.

Pre-train EVA01_CLIP_g_14_plus_psz14_s11B on Merged-2B with 14 nodes (click to expand).

MODEL=EVA01-CLIP-g-14-plus
PRETRAINED_IMAGE=/path/to/EVA01_g_psz14.pt
PRETRAINED_TEXT=/path/to/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/pytorch_model.bin

PRETRAINED_VISUAL_MODEL=EVA01-g-14-plus
PRETRAINED_TEXT_MODEL=OpenCLIP-H-14

# can automaticaly download and load pretrained models by follwing 4 lines; please check details in pretrained.py
# PRETRAINED_IMAGE=eva
# PRETRAINED_TEXT=laion2b_s32b_b79k
# PRETRAINED_VISUAL_MODEL=EVA01-g-14-plus
# PRETRAINED_TEXT_MODEL=OpenCLIP-H-14

# Following OpenCLIP, we preprocess data by webdataset. We concat paths of LAION-2B and COYO-700M with `;`.
MERGE_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar;/path/to/coyo700m_en_data/img_data/{000000..047435}.tar"
# LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"
VAL_DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=8 \
       	--nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env \
    training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to="wandb, tensorboard" \
        --wandb-project-name="eva-clip" \
        --wandb-notes="eva01_clip_g_plus_14" \
        --train-num-samples 40000000 \
        --dataset-resampled \
        --train-data-list=${MERGE_2B_DATA_PATH} \
        --dataset-type-list="webdataset;webdataset" \
        --imagenet-val=${VAL_DATA_PATH} \
        --warmup 2000 \
        --batch-size=1024 \
        --epochs=100 \
        --lr=5e-4 \
        --visual-lr=4e-4 \
        --text-lr=4e-5 \
        --wd=0.05 \
        --visual-wd=0.05 \
        --text-wd=0.05 \
        --ld=1.0 \
        --visual-ld=0.85 \
        --text-ld=0.75 \
        --grad-clip-norm=5.0 \
        --smoothing=0. \
        --workers=8 \
        --model=${MODEL} \
        --pretrained-image=${PRETRAINED_IMAGE} \
        --pretrained-text=${PRETRAINED_TEXT} \
        --pretrained-visual-model=${PRETRAINED_VISUAL_MODEL} \
        --pretrained-text-model=${PRETRAINED_TEXT_MODEL} \
        --skip-list head.weight head.bias lm_head.weight lm_head.bias mask_token text_projection logit_scale \
        --seed 4096 \
        --gather-with-grad \
        --grad-checkpointing \
        --local-loss \
        --force-custom-clip \
        --force-patch-dropout=0.5 \
        --optimizer="lamb" \
        --zero-stage=1 \
        --enable-deepspeed

Pre-train EVA02_CLIP_B_psz16_s8B on Merged-2B with 8 nodes (click to expand).

MODEL=EVA02-CLIP-B-16
PRETRAINED_IMAGE=/path/to/EVA02_B_psz14to16.pt
PRETRAINED_TEXT=/path/to/openai/clip-vit-base-patch16/pytorch_model.bin
PRETRAINED_VISUAL_MODEL=EVA02-B-16
PRETRAINED_TEXT_MODEL=OpenaiCLIP-B-16

# can automaticaly download and load pretrained models by follwing 4 lines; please check details in pretrained.py
# PRETRAINED_IMAGE=eva
# PRETRAINED_TEXT=openai
# PRETRAINED_VISUAL_MODEL=EVA02-B-16
# PRETRAINED_TEXT_MODEL=OpenaiCLIP-B-16

# Following OpenCLIP, we preprocess data by webdataset. We concat paths of LAION-2B and COYO-700M with `;`.

MERGE_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar;/path/to/coyo700m_en_data/img_data/{000000..047435}.tar"
# LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"
VAL_DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=8 \
       	--nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env \
    training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to="wandb, tensorboard" \
        --wandb-project-name="eva-clip" \
        --wandb-notes="eva02_clip_B_16" \
        --train-num-samples 40000000 \
        --dataset-resampled \
        --train-data-list=${MERGE_2B_DATA_PATH} \
        --dataset-type-list="webdataset;webdataset" \
        --imagenet-val=${VAL_DATA_PATH} \
        --warmup 2000 \
        --batch-size=2048 \
        --epochs=200 \
        --lr=5e-4 \
        --visual-lr=2e-4 \
        --text-lr=2e-5 \
        --wd=0.05 \
        --visual-wd=0.05 \
        --text-wd=0.05 \
        --ld=1.0 \
        --visual-ld=0.75 \
        --text-ld=0.75 \
        --grad-clip-norm=5.0 \
        --smoothing=0. \
        --workers=8 \
        --model=${MODEL} \
        --pretrained-image=${PRETRAINED_IMAGE} \
        --pretrained-text=${PRETRAINED_TEXT} \
        --pretrained-visual-model=${PRETRAINED_VISUAL_MODEL} \
        --pretrained-text-model=${PRETRAINED_TEXT_MODEL} \
        --skip-list head.weight head.bias lm_head.weight lm_head.bias mask_token text_projection logit_scale \
        --seed 4096 \
        --gather-with-grad \
        --grad-checkpointing \
        --local-loss \
        --force-custom-clip \
        --force-patch-dropout=0 \
        --optimizer="lamb" \
        --zero-stage=1 \
        --enable-deepspeed

Pre-train EVA02_CLIP_L_psz14_s4B on Merged-2B with 16 nodes (click to expand).

MODEL=EVA02-CLIP-L-14
PRETRAINED_IMAGE=/path/to/EVA02_L_psz14.pt
PRETRAINED_TEXT=/path/to/openai/clip-vit-large-patch14/pytorch_model.bin
PRETRAINED_VISUAL_MODEL=EVA02-L-14
PRETRAINED_TEXT_MODEL=OpenaiCLIP-L-14

# can automaticaly download and load pretrained models by follwing 4 lines; please check details in pretrained.py
# PRETRAINED_IMAGE=eva
# PRETRAINED_TEXT=openai
# PRETRAINED_VISUAL_MODEL=EVA02-L-14
# PRETRAINED_TEXT_MODEL=OpenaiCLIP-L-14

# Following OpenCLIP, we preprocess data by webdataset. We concat paths of LAION-2B and COYO-700M with `;`.
MERGE_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar;/path/to/coyo700m_en_data/img_data/{000000..047435}.tar"
# LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"
VAL_DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=8 \
       	--nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env \
    training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to="wandb, tensorboard" \
        --wandb-project-name="eva-clip" \
        --wandb-notes="eva02_clip_L_14" \
        --train-num-samples 40000000 \
        --dataset-resampled \
        --train-data-list=${MERGE_2B_DATA_PATH} \
        --dataset-type-list="webdataset;webdataset" \
        --imagenet-val=${VAL_DATA_PATH} \
        --warmup 2000 \
        --batch-size=1024 \
        --epochs=100 \
        --lr=5e-4 \
        --visual-lr=4e-4 \
        --text-lr=4e-5 \
        --wd=0.05 \
        --visual-wd=0.05 \
        --text-wd=0.05 \
        --ld=1.0 \
        --visual-ld=0.85 \
        --text-ld=0.75 \
        --grad-clip-norm=5.0 \
        --smoothing=0. \
        --workers=8 \
        --model=${MODEL} \
        --pretrained-image=${PRETRAINED_IMAGE} \
        --pretrained-text=${PRETRAINED_TEXT} \
        --pretrained-visual-model=${PRETRAINED_VISUAL_MODEL} \
        --pretrained-text-model=${PRETRAINED_TEXT_MODEL} \
        --skip-list head.weight head.bias lm_head.weight lm_head.bias mask_token text_projection logit_scale \
        --seed 4096 \
        --gather-with-grad \
        --grad-checkpointing \
        --local-loss \
        --force-custom-clip \
        --force-patch-dropout=0 \
        --optimizer="lamb" \
        --zero-stage=1 \
        --enable-deepspeed

Pre-train EVA02_CLIP_L_psz14_224to336 on Merged-2B with 16 nodes (click to expand).

MODEL=EVA02-CLIP-L-14-336
PRETRAINED=/path/to/EVA02_CLIP_L_psz14_224to336.pt

# can automaticaly download and load pretrained models by follwing 2 lines; please check details in pretrained.py
# MODEL=EVA02-CLIP-L-14-336
# PRETRAINED=eva_clip_224to336

# Following OpenCLIP, we preprocess data by webdataset. We concat paths of LAION-2B and COYO-700M with `;`.
MERGE_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar;/path/to/coyo700m_en_data/img_data/{000000..047435}.tar"
# LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"
VAL_DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=8 \
       	--nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env \
    training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to="wandb, tensorboard" \
        --wandb-project-name="eva-clip" \
        --wandb-notes="eva02_clip_L_14_336" \
        --train-num-samples 40000000 \
        --dataset-resampled \
        --train-data-list=${MERGE_2B_DATA_PATH} \
        --dataset-type-list="webdataset;webdataset" \
        --imagenet-val=${VAL_DATA_PATH} \
        --warmup 2000 \
        --batch-size=480 \
        --epochs=50 \
        --lr=5e-4 \
        --visual-lr=4e-4 \
        --text-lr=4e-5 \
        --wd=0.05 \
        --visual-wd=0.05 \
        --text-wd=0.05 \
        --ld=1.0 \
        --visual-ld=0.75 \
        --text-ld=0.65 \
        --grad-clip-norm=5.0 \
        --smoothing=0. \
        --workers=8 \
        --model=${MODEL} \
        --pretrained=${PRETRAINED} \
        --skip-list head.weight head.bias lm_head.weight lm_head.bias mask_token text_projection logit_scale \
        --seed 4096 \
        --gather-with-grad \
        --grad-checkpointing \
        --local-loss \
        --force-custom-clip \
        --force-patch-dropout=0 \
        --optimizer="lamb" \
        --zero-stage=1 \
        --enable-deepspeed

Pre-train EVA02_CLIP_E_psz14_s4B on LAION-2B with 18 nodes (click to expand).

MODEL=EVA02-CLIP-bigE-14
PRETRAINED_IMAGE=/path/to/EVA02_E_psz14.pt
PRETRAINED_TEXT=/path/to/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/pytorch_model.bin
PRETRAINED_VISUAL_MODEL=EVA02-bigE-14
PRETRAINED_TEXT_MODEL=OpenCLIP-H-14

# can automaticaly download and load pretrained models by follwing 4 lines; please check details in pretrained.py
# PRETRAINED_IMAGE=eva
# PRETRAINED_TEXT=laion2b_s32b_b79k
# PRETRAINED_VISUAL_MODEL=EVA02-bigE-14
# PRETRAINED_TEXT_MODEL=OpenCLIP-H-14

# Following OpenCLIP, we preprocess data by webdataset. We concat paths of LAION-2B and COYO-700M with `;`.
# MERGE_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar;/path/to/coyo700m_en_data/img_data/{000000..047435}.tar"
LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"
VAL_DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=8 \
       	--nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env \
    training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to="wandb, tensorboard" \
        --wandb-project-name="eva-clip" \
        --wandb-notes="eva02_clip_E_14" \
        --train-num-samples 40000000 \
        --dataset-resampled \
        --train-data=${LAION_2B_DATA_PATH} \
        --dataset-type="webdataset" \
        --imagenet-val=${VAL_DATA_PATH} \
        --warmup 2000 \
        --batch-size=800 \
        --epochs=100 \
        --lr=5e-4 \
        --visual-lr=4e-4 \
        --text-lr=4e-5 \
        --wd=0.05 \
        --visual-wd=0.05 \
        --text-wd=0.05 \
        --ld=1.0 \
        --visual-ld=0.9 \
        --text-ld=0.75 \
        --grad-clip-norm=5.0 \
        --smoothing=0. \
        --workers=8 \
        --model=${MODEL} \
        --pretrained-image=${PRETRAINED_IMAGE} \
        --pretrained-text=${PRETRAINED_TEXT} \
        --pretrained-visual-model=${PRETRAINED_VISUAL_MODEL} \
        --pretrained-text-model=${PRETRAINED_TEXT_MODEL} \
        --skip-list head.weight head.bias lm_head.weight lm_head.bias mask_token text_projection logit_scale \
        --seed 4096 \
        --gather-with-grad \
        --grad-checkpointing \
        --local-loss \
        --force-custom-clip \
        --force-patch-dropout=0.5 \
        --optimizer="lamb" \
        --zero-stage=1 \
        --enable-deepspeed

Pre-train EVA02_CLIP_E_psz14_plus_s9B on LAION-2B with 18 nodes (click to expand).

MODEL=EVA02-CLIP-bigE-14-plus
PRETRAINED_IMAGE=/path/to/EVA02_CLIP_E_psz14_plus_s9B.pt
PRETRAINED_TEXT=/path/to/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k/pytorch_model.bin # ckpt is splited into 2 parts. could merge first then load.
PRETRAINED_VISUAL_MODEL=EVA02-bigE-14
PRETRAINED_TEXT_MODEL=OpenCLIP-bigG-14

# can automaticaly download and load pretrained models by follwing 4 lines; please check details in pretrained.py
# PRETRAINED_IMAGE=eva
# PRETRAINED_TEXT=laion2b_s39b_b160k
# PRETRAINED_VISUAL_MODEL=EVA02-bigE-14
# PRETRAINED_TEXT_MODEL=OpenCLIP-bigG-14


# Following OpenCLIP, we preprocess data by webdataset. We concat paths of LAION-2B and COYO-700M with `;`.
# MERGE_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar;/path/to/coyo700m_en_data/img_data/{000000..047435}.tar"
LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"
VAL_DATA_PATH=/path/to/IN-1K/val

cd rei

python -m torch.distributed.launch --nproc_per_node=8 \
       	--nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env \
    training/main.py \
        --save-frequency 1 \
        --zeroshot-frequency 1 \
        --report-to="wandb, tensorboard" \
        --wandb-project-name="eva-clip" \
        --wandb-notes="eva02_clip_E_14" \
        --train-num-samples 40000000 \
        --dataset-resampled \
        --train-data=${LAION_2B_DATA_PATH} \
        --dataset-type="webdataset" \
        --imagenet-val=${VAL_DATA_PATH} \
        --warmup 2000 \
        --batch-size=1000 \
        --epochs=100 \
        --lr=5e-4 \
        --visual-lr=4e-4 \
        --text-lr=4e-5 \
        --wd=0.05 \
        --visual-wd=0.05 \
        --text-wd=0.05 \
        --ld=1.0 \
        --visual-ld=0.9 \
        --text-ld=0.75 \
        --grad-clip-norm=5.0 \
        --smoothing=0. \
        --workers=8 \
        --model=${MODEL} \
        --name='eva-vit-4b-14-text-bigG-x-lamb-patch_drop-18nodes-b144k-laion2b' \
        --pretrained-image=${PRETRAINED_IMAGE} \
        --pretrained-text=${PRETRAINED_TEXT} \
        --pretrained-visual-model=${PRETRAINED_VISUAL_MODEL} \
        --pretrained-text-model=${PRETRAINED_TEXT_MODEL} \
        --skip-list head.weight head.bias lm_head.weight lm_head.bias mask_token text_projection logit_scale \
        --seed 4096 \
        --gather-with-grad \
        --grad-checkpointing \
        --local-loss \
        --force-custom-clip \
        --force-patch-dropout=0.5 \
        --optimizer="lamb" \
        --zero-stage=1 \
        --enable-deepspeed

Extracting image and text features

Easily extracting image and text features in distribution and saving in .npy format. Here is an example of how you can do it:

MODEL=EVA02-CLIP-B-16
PRETRAINED=eva_clip
LAION_2B_DATA_PATH="/path/to/laion2b_en_data/img_data/{000000..164090}.tar"

IMG_EMB_PATH="/path/to/store/output/image_embedding"
TEXT_EMB_PATH="/path/to/store/output/text_embedding"

cd rei

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=$WORLD_SIZE --node_rank=$RANK \
	--master_addr=$MASTER_ADDR --master_port=12355 --use_env training/main.py \
        --val-data=${LAION_2B_DATA_PATH} \
        --val-num-samples 2000000000 \
        --batch-size 1024 \
        --model ${MODEL} \
        --force-custom-clip \
        --pretrained ${PRETRAINED} \
        --extract-features \
        --img-emb-path ${IMG_EMB_PATH} \
        --text-emb-path ${TEXT_EMB_PATH} \
        --save-interval 10 \
        --enable_deepspeed

BibTeX & Citation

@article{EVA-CLIP,
  title={EVA-CLIP: Improved Training Techniques for CLIP at Scale},
  author={Sun, Quan and Fang, Yuxin and Wu, Ledell and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2303.15389},
  year={2023}
}

Acknowledgement

EVA-CLIP is built using the awesome OpenCLIP, EVA-01, CLIP, timm, DeepSpeed, Apex and xFormer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Summary of EVA-CLIP performance

Model Card

EVA-01-CLIP Series

EVA-02-CLIP Series

Setup

Usage

Evaluation of Zero-shot Image Classification Performance

Evaluate EVA-CLIP on IN-1K

Pre-training

Pre-train EVA-CLIP on LAION-2B dataset

Extracting image and text features

BibTeX & Citation

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Summary of EVA-CLIP performance

Model Card

EVA-01-CLIP Series

EVA-02-CLIP Series

Setup

Usage

Evaluation of Zero-shot Image Classification Performance

Evaluate EVA-CLIP on IN-1K

Pre-training

Pre-train EVA-CLIP on LAION-2B dataset

Extracting image and text features

BibTeX & Citation

Acknowledgement