Skip to content

ariesiitr/Expected-Calibration-Error-for-Vision-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Investigation of Vision Models

Vision Models learn "shortcuts", i.e. easily learnable features instead of actual visual cues, this causes the non-transferability of the models, i.e. they aren't practical to use in a real-world scenario. Some of this model's issues can be related to miscalibration, when a model’s confidence in its predictions does not align with actual accuracy. For example, if the model predicts some inputs to be 90% confident, does the actual accuracy be 90%?

This can be measured using Expected Calibration Error. It involves dividing the predicted probabilities into bins and calculating the accuracy and average predicted probability (confidence) for each bin. ECE is the weighted average of the absolute differences between these accuracies and confidences across all bins, reflecting how much predicted probabilities deviate from true likelihoods. For instance, if a model predicts a 70% chance of an event and it happens 70% of the time, the model is well-calibrated. ECE helps in identifying models that give reliable probability estimates, crucial for applications where understanding uncertainty is important.

Inspiration

This was inspired by https://arxiv.org/pdf/2311.09215 "ConvNet vs Transformer, Supervised vs CLIP: Beyond ImageNet Accuracy"

This paper deeply analysed vision models, while differentiating between conventional CNNs (ConvNext) & upcoming transformer-based architectures (ViT). It also differentiates between supervised and CLIP (Contrastive Language Image Pre-Training). The authors go into the depth of which model is better at which particular task and which is more "transferable", i.e. has more real-world potential. They used ImageNet-1k and ImagNet-R to evaluate already pre-trained models. My goal with this repo is to recreate some parts of this reasearch paper, while improving my understanding of ECE, Convnext, ViT & CLIP. Interestingly, I had slightly different results compared to the research paper.

My Results

I have used Imagenet-1k's validation as the author would have to get these results.

ConvNext-sup

ECE=0.09 This is vastly different to the article, which has an ECE of 0.0281, also the accuracy in my test is 1% lower. Why this difference? I have used to exact architecture, training and fine-tuned model as specified.

image

ViT-sup

Its calibration is much higher, outperforming what's given in the reference but it has a lower accuracy.

image

ConvNext-CLIP

Again, we see a vast difference in performance.

image

ViT-CLIP

This time the ECE is an exact match but the accuracy of the model is tremendously better.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published