Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't achieve the accuracy in the paper with cifar10 #3

Open
FiresWorker opened this issue Dec 2, 2020 · 37 comments
Open

Can't achieve the accuracy in the paper with cifar10 #3

FiresWorker opened this issue Dec 2, 2020 · 37 comments

Comments

@FiresWorker
Copy link

I use the kNN classification as a monitor during training. As shown in Figure D.1 in paper, the accuracy is about 60% in the beginning and finally achieve 90%. I can't achieve this accuracy and just achieve a very low accuracy with the parameter mentioned in the paper.

If anyone can achieve the results in the paper, thank you very much for sharing some experimental details.

@PatrickHua
Copy link
Owner

In appendix section D:
We do not use blur augmentation. The backbone is the CIFAR variant of ResNet-18 [19], followed by a 2-layer projection MLP. The outputs are 2048-d.
I already removed gaussian blur for image size equal or less than 32(cifar10). They seem also removed one layer of the projection mlp. You should try commenting out the second layer in projection_MLP

I'm working on it. Also, could you show me the way you use KNN to monitor the training?

@matthiasware
Copy link

Hi, I reached 72% acc within 100 epochs, starting from 40% after the first epoch. I was using sklearn.neighbors.KNeighborsClassifier, but this does note really scale with larger datasets ;)

@PatrickHua
Copy link
Owner

Hi, I reached 72% acc within 100 epochs, starting from 40% after the first epoch. I was using sklearn.neighbors.KNeighborsClassifier, but this does note really scale with larger datasets ;)

72% in 100 epochs? That's almost the same with the paper. I only got 80% accuracy when trained for 800 epochs (see configs/cifar10_experiment.sh). What's your training configuration?

@FiresWorker
Copy link
Author

Thank you for replying to my question.

https://colab.research.google.com/github/facebookresearch/moco/blob/colab-notebook/colab/moco_cifar10_demo.ipynb

I use the knn classification in this code.

@matthiasware
Copy link

matthiasware commented Dec 7, 2020

72% in 100 epochs? That's almost the same with the paper. I only got 80% accuracy when trained for 800 epochs (see configs/cifar10_experiment.sh). What's your training configuration?

batch_size=512, lr=0.03, backbone=resent18, optimizer=sgd with cosine_decay like in the paper and 2 layers in the projection head.

What bothers me more is that i cannot get a higher acc of more than ~30% (+-5%) on the linear evaluation. whereas in the paper they achieve 91.8%. I am really unsure how they achieved it! Different training setups and multiple runs do not seem to improve this result.

Does anyone have similiar issues?

Also the std is extremely unstable, unlike the results the paper!

image

@FiresWorker
Copy link
Author

batch_size=512, lr=0.03, backbone=resent18, optimizer=sgd with cosine_decay like in the paper and 2 layers in the projection head.

What bothers me more is that i cannot get a higher acc of more than ~30% (+-5%) on the linear evaluation. whereas in the paper they achieve 91.8%. I am really unsure how they achieved it! Different training setups and multiple runs do not seem to improve this result.

Does anyone have similiar issues?

Also the std is extremely unstable, unlike the results the paper!

I use the gaussian blur augmentation and change the projection.

Run 200 epochs, for knn classification, I can get 72% accuracy (from 30% to 72%), and for linear evaluation, I can get 74% accuracy (from 67% to 74%). If I run more epochs, the results may be improved.

But in knn classification, the accuracy first dropped from 30% to 28%, and then increased to 72%.

May be you can try a large base learning rate, like 30.0. And don't use the weight decay.

@matthiasware
Copy link

batch_size=512, lr=0.03, backbone=resent18, optimizer=sgd with cosine_decay like in the paper and 2 layers in the projection head.
What bothers me more is that i cannot get a higher acc of more than ~30% (+-5%) on the linear evaluation. whereas in the paper they achieve 91.8%. I am really unsure how they achieved it! Different training setups and multiple runs do not seem to improve this result.
Does anyone have similiar issues?
Also the std is extremely unstable, unlike the results the paper!

I use the gaussian blur augmentation and change the projection.

Run 200 epochs, for knn classification, I can get 72% accuracy (from 30% to 72%), and for linear evaluation, I can get 74% accuracy (from 67% to 74%). If I run more epochs, the results may be improved.

But in knn classification, the accuracy first dropped from 30% to 28%, and then increased to 72%.

May be you can try a large base learning rate, like 30.0. And don't use the weight decay.

Thanks, it works!

@codergan
Copy link

codergan commented Dec 9, 2020

batch_size=512, lr=0.03, backbone=resent18, optimizer=sgd with cosine_decay like in the paper and 2 layers in the projection head.
What bothers me more is that i cannot get a higher acc of more than ~30% (+-5%) on the linear evaluation. whereas in the paper they achieve 91.8%. I am really unsure how they achieved it! Different training setups and multiple runs do not seem to improve this result.
Does anyone have similiar issues?
Also the std is extremely unstable, unlike the results the paper!

I use the gaussian blur augmentation and change the projection.
Run 200 epochs, for knn classification, I can get 72% accuracy (from 30% to 72%), and for linear evaluation, I can get 74% accuracy (from 67% to 74%). If I run more epochs, the results may be improved.
But in knn classification, the accuracy first dropped from 30% to 28%, and then increased to 72%.
May be you can try a large base learning rate, like 30.0. And don't use the weight decay.

Thanks, it works!

hi bro, so how is your result now? did you achieve 91% ?

@PatrickHua
Copy link
Owner

batch_size=512, lr=0.03, backbone=resent18, optimizer=sgd with cosine_decay like in the paper and 2 layers in the projection head.
What bothers me more is that i cannot get a higher acc of more than ~30% (+-5%) on the linear evaluation. whereas in the paper they achieve 91.8%. I am really unsure how they achieved it! Different training setups and multiple runs do not seem to improve this result.
Does anyone have similiar issues?
Also the std is extremely unstable, unlike the results the paper!

I use the gaussian blur augmentation and change the projection.
Run 200 epochs, for knn classification, I can get 72% accuracy (from 30% to 72%), and for linear evaluation, I can get 74% accuracy (from 67% to 74%). If I run more epochs, the results may be improved.
But in knn classification, the accuracy first dropped from 30% to 28%, and then increased to 72%.
May be you can try a large base learning rate, like 30.0. And don't use the weight decay.

Thanks, it works!

hi bro, so how is your result now? did you achieve 91% ?

I fix a small problem in the linear evaluation and it eventually gives 85%.

@matthiasware
Copy link

My results for the following run on CIFAR10 with the parameters from the paper:

  • ACC (train set): 85%
  • ACC (test set): 83%

however the average std over all channels is unstable! In 2 out of 10 runs it completely collapsed! So I am unsure about their claim that they successfully prevent collapsing!

@Asamisora
Copy link

batch_size=512, lr=0.03, backbone=resent18, optimizer=sgd with cosine_decay like in the paper and 2 layers in the projection head.
What bothers me more is that i cannot get a higher acc of more than ~30% (+-5%) on the linear evaluation. whereas in the paper they achieve 91.8%. I am really unsure how they achieved it! Different training setups and multiple runs do not seem to improve this result.
Does anyone have similiar issues?
Also the std is extremely unstable, unlike the results the paper!

I use the gaussian blur augmentation and change the projection.
Run 200 epochs, for knn classification, I can get 72% accuracy (from 30% to 72%), and for linear evaluation, I can get 74% accuracy (from 67% to 74%). If I run more epochs, the results may be improved.
But in knn classification, the accuracy first dropped from 30% to 28%, and then increased to 72%.
May be you can try a large base learning rate, like 30.0. And don't use the weight decay.

Thanks, it works!

hi bro, so how is your result now? did you achieve 91% ?

I fix a small problem in the linear evaluation and it eventually gives 85%.

Hi, I run cifar_experiment.sh and get training loss ~-.882, but get evalution acc ~40 (the evalution epoch had set to 100), could you share the evalution parameters?

@ahmdtaha
Copy link

Thanks PatrickHua for your implementation.
I followed this issue because I wasn't able to achieve the report 90+ performance reported on CIFAR10.
I think I figured the core reason for that. The paper does Not use ResNet18 for the CIFAR10 experiment. The paper states that "The backbone is the CIFAR variant of ResNet-18". Accordingly, a resnet with [2, 2, 2, 2] is not enough.

I am currently using this resnet-cifar variant[1]. Note the conv1 has 3x3, not 7x7, kernels. I also commented the maxpool layer. Now my KNN accuracy reaches 89%.
I am training with batch size = 512 on a single GPU. So I use lr=0.06 because the base lr=0.03

[1] https://github.com/huyvnphan/PyTorch_CIFAR10/blob/master/cifar10_models/resnet.py

@Xiatian-Zhu
Copy link

Thanks PatrickHua for your implementation.
I followed this issue because I wasn't able to achieve the report 90+ performance reported on CIFAR10.
I think I figured the core reason for that. The paper does Not use ResNet18 for the CIFAR10 experiment. The paper states that "The backbone is the CIFAR variant of ResNet-18". Accordingly, a resnet with [2, 2, 2, 2] is not enough.

I am currently using this resnet-cifar variant[1]. Note the conv1 has 3x3, not 7x7, kernels. I also commented the maxpool layer. Now my KNN accuracy reaches 89%.
I am training with batch size = 512 on a single GPU. So I use lr=0.06 because the base lr=0.03

[1] https://github.com/huyvnphan/PyTorch_CIFAR10/blob/master/cifar10_models/resnet.py

Great spot, pal. Could you please clarify which maxpool layer you commented? and why? Thanks a lot.

@ahmdtaha
Copy link

ahmdtaha commented Jan 5, 2021

Thanks PatrickHua for your implementation.
I followed this issue because I wasn't able to achieve the report 90+ performance reported on CIFAR10.
I think I figured the core reason for that. The paper does Not use ResNet18 for the CIFAR10 experiment. The paper states that "The backbone is the CIFAR variant of ResNet-18". Accordingly, a resnet with [2, 2, 2, 2] is not enough.
I am currently using this resnet-cifar variant[1]. Note the conv1 has 3x3, not 7x7, kernels. I also commented the maxpool layer. Now my KNN accuracy reaches 89%.
I am training with batch size = 512 on a single GPU. So I use lr=0.06 because the base lr=0.03
[1] https://github.com/huyvnphan/PyTorch_CIFAR10/blob/master/cifar10_models/resnet.py

Great spot, pal. Could you please clarify which maxpool layer you commented? and why? Thanks a lot.

There is a single maxpool layer :)
https://github.com/huyvnphan/PyTorch_CIFAR10/blob/24ac04fe10874b6d36116a83c8d42778df9ad65a/cifar10_models/resnet.py#L130

I commented the maxpool layer because He et al.,[1] stated that "The subsampling is performed by convolutions with a stride of 2" in section 4.2.

[1] Deep Residual Learning for Image Recognition

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 5, 2021 via email

@ahmdtaha
Copy link

ahmdtaha commented Jan 5, 2021

I didn't find an official version. I wish FB shares one.
I also tried the ResNet variant you mentioned. But I do not remember why I did not use it eventually -- I made a lot of changes while resolving this issue :)

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 5, 2021 via email

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 7, 2021 via email

@ahmdtaha
Copy link

ahmdtaha commented Jan 7, 2021

My implementation is mostly inspired by PatrickHua's; so I felt bad about uploading it to my Github. But I guess PatrickHua's repository is already well recognized, so my implementation would not make much difference. I will clean my version and upload it tomorrow.

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 7, 2021 via email

@PatrickHua
Copy link
Owner

My implementation is mostly inspired by PatrickHua's; so I felt bad about uploading it to my Github. But I guess PatrickHua's repository is already well recognized, so my implementation would not make much difference. I will clean my version and upload it tomorrow.

Don't feel bad lol. It's open source so you can do anything with my code! I'm also quite curious about your implementation haha.

@yaoweilee
Copy link

Hey Guys, I achieved 90.6% KNN acc on CIFAR10 Validset. Basically, I tried everything you mentioned above, including the Cosine-Sim-KNN and changing the model to the Resnet18-CIFART variant. According to my experiments, the Cosine Similarity-based KNN usually performs better than L2-based KNN with a 2% boost. As for the network structure, the exact implementation of the Resnet18 CIFAR10 Variant mentioned in the paper is too simple with its 64-d feature output. So I simply did what @ahmdtaha did and it worked well.

Some of my implementation details are as follows:
optimizer: SGD, lr: 0.06, weight decay: 5e-4, momentum: 0.9, batch size: 512, Max epoch: 800, Warmup epoch: 2
for knn parameters: knn_k: 25, knn_t: 0.1
In addition, I used the cosine learning rate schedule implemented in Swav

Hope it can help, cheers

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 8, 2021 via email

@ahmdtaha
Copy link

ahmdtaha commented Jan 8, 2021

I uploaded my implementation here
It supports DistributedDataParallel. The pretrain_main.py should deliver 89.xx KNN accuracy out of the box.
I will keep an eye on the Github issues in case something is missing.

Thanks again PatrickHua for your implementation.

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 8, 2021 via email

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 8, 2021 via email

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 8, 2021 via email

@ahmdtaha
Copy link

ahmdtaha commented Jan 8, 2021

SimSiam has two phases (refer [1] Figure D.1.)

  1. Pretraining a model in a self-supervised manner (no labels)
  2. Training a classifier in a supervised manner while freezing the model's weights (resnet weights)

In the first phase, we evaluate using a KNN classifier (non-linear classifier). This is already uploaded. This should give 89.xx accuracy (Figure D.1. left)
In the second phase, we evaluate the self-supervision task (the frozen resnet weights) using a linear classifier. This is not uploaded/clean yet. This should give 91.xx accuracy (Figure D.1. right).

[1] Exploring Simple Siamese Representation Learning
P.S. I think it is better to move this discussion to my repository.

@Xiatian-Zhu
Copy link

Xiatian-Zhu commented Jan 8, 2021 via email

@PatrickHua
Copy link
Owner

PatrickHua commented Jan 11, 2021

I changed the backbone to a cifar variant ahmdtaha(#3 (comment)) proposed. The code now gives 90.8 linear evaluation accuracy out of the box.

@ahmdtaha
Copy link

Great News @PatrickHua. Please keep this issue open. We still lag 1%. Maybe someone can figure it out.

@taoyang1122
Copy link

@PatrickHua @ahmdtaha Hi guys, I reproduced the results on ImageNet (67.8% Top-1 accuracy). You can take a look if interested. Link is here.

@Hzzone
Copy link

Hzzone commented Apr 10, 2021

How many GPUs do you use for cifar10, I have tried 8/4 2080Ti, I found that 4 GPUs performed more stable than 8 GPUs. I cannot fit the training into 1 GPU due to the limited memory. It implies that the SimSiam or BYOL without negative samples are benefitting from the BN with large batch size, even when they said the SimSiam has no need for large batch size, but SimSiame has to fit large BS for single GPU. It is my guess, maybe not right.

@ahmdtaha
Copy link

@Hzzone what are your batch size and network architecture? I am not sure why you can't fit the training on a single GPU, especially for cifar10 (32x32).

BTW, when FB guys talk about a large batch-size, they mean 4096.

@Hzzone
Copy link

Hzzone commented Apr 10, 2021

@ahmdtaha I am sorry for my mistake. I used the same resnet18 as yours. I can train on single 2080Ti GPU though it runs much slower than multi-gpus, 2 iter/s vs 7 iter/s. I will take more tries to find the reason why SimSiam works not so well in my case wrt less training stability than BYOL.

@ahmdtaha
Copy link

@Hzzone That makes more sense. BTW, keep an eye on the learning rate. The learning rate depends on the batch size only. In my code, I use lr=0.06 with a batch size 512 without regard for the number of GPUs. It has nothing to do with the number of GPUs.

@Hzzone
Copy link

Hzzone commented Apr 10, 2021

@Hzzone That makes more sense. BTW, keep an eye on the learning rate. The learning rate depends on the batch size only. In my code, I use lr=0.06 with a batch size 512 without regard for the number of GPUs. It has nothing to do with the number of GPUs.

Thanks for your advice, I set my lr as suggested by the paper, ie, lr = base_lr(0.03)*bs/256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants