-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement the model that won the classification task of ImageNet 2013 #33
Conversation
Thanks for the model definition! Have you trained and evaluated this? |
It took Caffe more than 9 days to train the ImageNet dataset on an NVIDIA Tesla K20 GPU which is a quite high end device. Currently I only have access to GTX 560 Ti whose memory is not enough to load the model parameters. Hopefully someone who is interested in reproducing the winning results and have the hardware resources will solve this problem. It would be even better to share the trained model on the website of caffe as the author of Caffe has done. |
I have done some initial tests, but the new net is 40% slower to train. So Sergio
|
Thank you for your effort, @sguada! What are the results of your initial tests? If it takes too much time, you probably would like to first train and test on a smaller dataset such as a portion of of ImageNet classification task's dataset to verify that the model would work as Zeiler described. If its performance is inferior to expected, it is necessary to do some debugging such as checking whether or not conv2 perform group convolution. Maybe all the exact implementation details cannot be derived from paper. |
A couple of considerations (basing off http://www.matthewzeiler.com/pubs/arxive2013/arxive2013.pdf): The imagenet preprocessing proposed by the caffe tutorial resizes to 256x256 without preserving aspect ratio; Zeiler uses 256 min dimension rescaling and center crop of 256x256. conv2, conv4, and conv5 layers in the original model definition have 'group: 2' turned on. If I'm not mistaken, this is the sparse architecture used by Krizhevsky since they split the training over 2 gpus. Zeiler mentions that he uses dense connections instead. Zeiler mentions the use of 224x224 crops, but that results in one less 1st layer filter per dimension than reported in his paper. Zeiler also initializes all biases to 0 instead of alternating 0 and 1 in different layers that Krizhevsky used. I haven't checked, but the padding for different layers may also need to be changed to match the layer dimensions reported in the paper. Most importantly, Zeiler mentions that they 'renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10^-1 to this fixed radius' as key to preventing individual filters from dominating the first layer. I don't think this is implemented currently (in my understanding this is different from local contrast normalization layers, and instead normalizes the convolutional filter weights to not exceed some variance). Finally, the numbers reported in that paper are still quite significantly behind the actual winning system (and he mentions that the performance in the paper has been surpassed in the ILSVRC 2013 competition), so there are probably more tweaks that he made which are unpublished. I have been training a network based on my interpretation of the above (without the convolutional filter RMS normalization) but it is very slow on my Tesla M2090. After about 3 weeks and 60 epochs, top 1 error for validation is about 41.5%, which is still higher than Krizhevsky's result; we will see if that improves much further. Does anyone else have insight into the details of the conv RMS normalization? |
@kloudkl I have been doing some small training, I mean training for a few epochs and compare the validation performance to the log we have of our training of Krizhevsky network. @SWu o fix the misalignment between the crop 224x224 and the first layer filter reported by Zeiler in his paper one could either pad 1 pixel (which is done automatically in cuda-convenet but not in caffe) or just do a crop of 225x225 as @kloudkl suggested. @SWu it seems to me that the top1 validation you are getting after 60 epochs is very low, as you can see in the figure below, the Krizhevsky network can achieve top1 validation error of 0.4058 in 20 epochs, 0.5529 in 40 epochs and 0.574 in 60 epochs. So @SWu I don't think your network is going to be able to improve much further after 60 epochs. It is true that I this point caffe don't have layer to renormalize the filters as Zeiler described in his paper, so that could be the reason why the performance is worse. We could try to add it, so if you want to work on this let me know. What I have mostly been doing is adjusting the base_lr, gamma, weight_decay and stepsize of the solver to adjust for the change in the batch size from 256 to 128 and to improve the speed of training. So far I'm not able to match the speed of training but in my second attempt is getting closer. See figure below |
@sguada When I say top1 validation error, I mean (1 - accuracy) reported by the accuracy layer. So it is actually fluctuating at around 58.5% accuracy, which is ~1% higher than your log of Krizhevsky network at that point. |
@SWu sorry I misread your previous post. That is not bad then, according to Zeiler paper his network got 38.4 top1 validation error after training for 70 epochs, it seems you are getting close. But if you haven't changed the learning rates, stepsize, and gamma that we have for the alexnet in caffe, I would not expect to improve much after 60 epochs. Do you have a log file of your training? If you don't mind could you share it? Or your prototxt files? |
The diff of the prototxt: http://pastebin.com/M49MTupT Changes to the solver prototxt: By the way, am I correct in thinking that the 4,500,000 max_iter in the imagenet_solver.prototxt is a typo and it should actually be 450,000? Imagenet has ~1,280,000 images, so with batchsize of 256, every 5000 iterations is an epoch, and 90 epochs should be 450,000. I don't have a log file since I'm printing directly to stderr, but I am very close to your numbers for 20 and 40 epochs, but ~1% higher for 60 epochs (actually, I haven't quite reached 60 yet, it's 75% through the 59th epoch). |
@SWu yeah there was a typo in the imagenet_solver.prototxt it should be 450,000 which represent 90 epochs. Before I was also printing to stderr but now I always redirect it to file by adding " 2> log.txt" so I can look at it later and analyze it later. If you have been saving the snapshots during the training you could test them later and see how the performance was changing. It is interesting to know that at 20 and 40 epochs the performance was close but you needed to wait until 60 epochs to see an improvement. I would have thought that the improvement will be there from the beginning. The other thing I did to adjust for the batch being half was reducing the weight_decay by half, but not sure how much that will affect to the final performance. |
max iteration no. is 450,000 (= 90 epochs) caught by @SWu #33 (comment)
@SWu did you finish to train the network? Could you share your results? |
Finally got my hands on a better gpu (a k40 :) ) and was able to retry some things including fixing the order of the LRN and MaxPool ordering and tweaking the padding. This gave ~59.95% validation accuracy after about 2 weeks. One observation with these new changes is that I start seeing better validation accuracy immediately even in the first few epochs, compared to your logs. See the prototxt and validaton log here: http://pastebin.com/hb2Tp3rd This still does not have the convolution re-normalization described in Zeiler's paper. Is someone working on that currently? |
@SWu I cannot see anything in your link. Could you share your prototxt file again? |
For filter renormalization, is this just a matter of dividing the coefficients by a term such that the l2 norm of each filter is constant across all filters at each point during training? |
May anyone update the status of this PR implementing the 2013 winning model? |
Caffe is missing a single operation in the ZF net [1] for filter regularization:
If you add this filter RMS cap layer to Caffe the model can be trained as described in the paper for public reference. [1] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. Arxiv. org, 1131:v3, 2013. |
@shelhamer Thanks much Evan! I really appreciate the pointer. |
Probably @Yangqing has some news from Imagenet 2014 |
This is superseded by VGG's devil models from BMVC14 now in the model zoo and readied for use by #1138. Thanks @ksimonyan and VGG for sharing the models! |
max iteration no. is 450,000 (= 90 epochs) caught by @SWu BVLC#33 (comment)
Changes relative to imagenet(_val/_deploy).prototxt: data cropsize 225; conv1 kernelsize7, stride 2; conv2 group 1, stride 2.
This fixes #32.