Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy. #29

Open
zer0int opened this issue May 31, 2024 · 2 comments

Comments

@zer0int
Copy link

zer0int commented May 31, 2024

Dear researchers,

I just wanted to let you know about some findings I made with your amazing Long-CLIP model; while ViT-L/14 (77 tokens) also shows partial mitigation of the typographic attack vulnerability when GmP-fine-tuned on SPRIGHT-CoCo, Long-CLIP seems to show full mitigation of typographic attack vulnerability when fine-tuned with GmP on the full long-labels of CoCo-SPRIGHT.
Exception: Residual vulnerability with non-English text, most likely due to bias / minority in pre-training data.

But, for the classic OpenAI examples of "apple ipod" and "piggy bank poodle":

apple-ipod-demo

poodle-demo

Accuracy on ImageNet/ObjectNet MVT also improved due to GmP-finetune (accuracy scores for LongCLIP-L, GmP-LongCLIP @ Epochs 1/2, @ Epochs final). 10 Epochs, 3-4 hours on 1x RTX4090 @ batch_size 34. I would assume results could be even better for a multi-GPU setup with a more typical batch size!

logit-good-clip-my-long

You can find the full details and code for reproduction in my fork of your repo. Kind regards!

@beichenzbc
Copy link
Owner

Thanks to your amazing findings! Your research passion and professionalism really impress us. We're glad that our work can make contributions to the community.

@zer0int
Copy link
Author

zer0int commented Jun 20, 2024

Update: By scaling the activation value of an "adverb neuron"[1] in the ViT by a factor of 1000 during fine-tuning, LongCLIP adjusted + found a "better minimum", further boosting accuracy (same dataset I used last time, CoCo-SPRIGHT-40k):

LongCLIP-eval-2

Compare to OpenAI/ViT-L/14, trained in the exact manner (hyperparameters, epochs, manipulation of acts), but with short labels:

eval-clip-gpt4-compare

As anticipated, your model already outperformed the "short-CLIP" (77) for many classes, and I was not able to improve it further for some classes with my fine-tune - whereas I was able to boost performance of OpenAI/CLIP for all classes.

[1] Adverb neuron: A feature (Layer 22, Feature 2432) that, when activation value is scaled x1000, will lead CLIP (gradient ascent -> optimize text embeddings for cosine similarity with image embeddings) to make conclusions about images using "visually meaningless" adverbs. Potentially associated with "text obsession" (typographic attack vulnerability) of model. Scaling this activation x1000 leads to exploding gradients initially; however, the model is surprisingly robust and able to compensate during later epochs, eventually finding a solution that appears superior.

activation-scaling-longclip-ft

Gradient ascent, erratic "CLIP opinion" about image with amplified "adverb neuron":

adverb-neuron-compared

No need to respond, I know you're probably busy - just wanted to share a potentially interesting / relevant new result. =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants