Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy. #29

zer0int · 2024-05-31T11:52:46Z

Dear researchers,

I just wanted to let you know about some findings I made with your amazing Long-CLIP model; while ViT-L/14 (77 tokens) also shows partial mitigation of the typographic attack vulnerability when GmP-fine-tuned on SPRIGHT-CoCo, Long-CLIP seems to show full mitigation of typographic attack vulnerability when fine-tuned with GmP on the full long-labels of CoCo-SPRIGHT.
Exception: Residual vulnerability with non-English text, most likely due to bias / minority in pre-training data.

But, for the classic OpenAI examples of "apple ipod" and "piggy bank poodle":

Accuracy on ImageNet/ObjectNet MVT also improved due to GmP-finetune (accuracy scores for LongCLIP-L, GmP-LongCLIP @ Epochs 1/2, @ Epochs final). 10 Epochs, 3-4 hours on 1x RTX4090 @ batch_size 34. I would assume results could be even better for a multi-GPU setup with a more typical batch size!

You can find the full details and code for reproduction in my fork of your repo. Kind regards!

beichenzbc · 2024-05-31T12:07:40Z

Thanks to your amazing findings! Your research passion and professionalism really impress us. We're glad that our work can make contributions to the community.

zer0int · 2024-06-20T09:50:06Z

Update: By scaling the activation value of an "adverb neuron"[1] in the ViT by a factor of 1000 during fine-tuning, LongCLIP adjusted + found a "better minimum", further boosting accuracy (same dataset I used last time, CoCo-SPRIGHT-40k):

Compare to OpenAI/ViT-L/14, trained in the exact manner (hyperparameters, epochs, manipulation of acts), but with short labels:

As anticipated, your model already outperformed the "short-CLIP" (77) for many classes, and I was not able to improve it further for some classes with my fine-tune - whereas I was able to boost performance of OpenAI/CLIP for all classes.

[1] Adverb neuron: A feature (Layer 22, Feature 2432) that, when activation value is scaled x1000, will lead CLIP (gradient ascent -> optimize text embeddings for cosine similarity with image embeddings) to make conclusions about images using "visually meaningless" adverbs. Potentially associated with "text obsession" (typographic attack vulnerability) of model. Scaling this activation x1000 leads to exploding gradients initially; however, the model is surprisingly robust and able to compensate during later epochs, eventually finding a solution that appears superior.

Gradient ascent, erratic "CLIP opinion" about image with amplified "adverb neuron":

No need to respond, I know you're probably busy - just wanted to share a potentially interesting / relevant new result. =)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy. #29

Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy. #29

zer0int commented May 31, 2024

beichenzbc commented May 31, 2024

zer0int commented Jun 20, 2024

Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy. #29

Geometric Parameterization (GmP) of MLP and fine-tuning on CoCo-40k-SPRIGHT (spatially-right long-labels) eliminates typographic attack vulnerability in Long-CLIP (but not short-CLIP 77 tokens) + improves ImageNet/ObjectNet accuracy. #29

Comments

zer0int commented May 31, 2024

beichenzbc commented May 31, 2024

zer0int commented Jun 20, 2024