-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.94 AUC not reproducible #12
Comments
When it comes to AUC, we also experienced fluctuating results. We have for example The original paper proposes evaluating the linear average of predictions with an ensemble of 10 trained models. To create an such an ensemble of trained models from the code of this repo, use the -lm parameter. To specify an ensemble, the model paths should be comma-separated or satisfy a regular expression. For example: -lm=./tmp/model-1,./tmp/model-2,./tmp/model-3 |
Do you believe running Thank you for your answer! |
Yes, in combination with applying a different seed with the In our study we did not redistribute though. We only distributed once with the default seed in the script, and all our models and ensemble are created from training with that image distribution. |
Sorry for not providing a faster reply, but the maximum AUC I got in my tests was 0.90 using the evaluate.py on eyepacs with the default seed. I suppose eventually I can reach 0.94 or higher, but I think it will be better to move to an ensemble model to get better results quicker. |
Can you please help me on how did you achieve similarl results? |
I trained about 10 models with the default seed, and tried to use evaluate.py on the ones that got the higher AUC during the training cross-validation. One of the models, which had 0.77 AUC during cross-validation, gave me 0.90 AUC in the execution of So I do not have any specific tip, just run more models until you get something good :P |
@fbmoreira You experience exactly the same as we experienced when we trained our models. Some are bad (around 0.70 AUC), most of them are ok-ish (~ 0.85 AUC), some are better and exceed 0.90 AUC on evaluation. What we learned is that using the power of all these models together (both bad and better models) in an ensemble always yields a better result. |
@fbmoreira Apologies for being dumb (still new to Deep Learning) but can you please explain training different models. Did you use different neural network architectures or something else. Please explain. |
@Sayyam-Jain When you run @fbmoreira NB: The random seed |
I didn't say a thing about fixed initialization o_O I knew it was only for the dataset partition since it was in the Reading your code it was clear to me you initialized the inception v3 model with imagenet weights and I assume the only (small) random weight there is in the top layer initialization. I think that your results omitting the --only-gradable were better because the noise introduced might have helped the network to better generalize the problem, hence your higher AUC. Another thing that might help in the future is to introduce gaussian noise/"pepper-and-salt" as a form of augmentation, although the size of the hemorrhages and microaneurisms might be indistinguishable from the noise, so I am not sure. |
Ah ok, excuse me for my misunderstanding! Regarding vertical flips in data augmentation: the objective of this project was to try to replicate the model and reproduce the results made in this paper. Since the team in that paper did not vertically flip images in their augmentation, we did not either. Regarding your last point: it seems definitely likely that the noise in non-gradable images improves the generalization and reduces the chance of overfitting. I am however not sure how big the effects are of training the network with wrong labels for those non-gradable images. I still lean towards the option of using only gradable images and applying random data augmentation to them, but during our project we did not test if this actually leads to better results. |
Using the ensemble of pretrained models, I get AUC of 0.91 on the test dataset rather than 0.95. I followed the instructions in downloading the dataset. Should I be getting 0.95? Does something need to be changed? |
Hey @slala2121, just to confirm, did you download the models from here https://figshare.com/articles/dataset/Trained_neural_network_models/8312183? Also, what Tensorflow version and Python did you run with? |
Following your README step by step and creating several models, only one model has only achieved 0.76 AUC for EyePACS so far. It's not clear to me whether the reported AUC of 0.94 used a single model or an ensemble... I'll try an ensemble, but most of the models I am running end with 0.52~ AUC, which means they are likely not contributing much.
Are there any undetailed reasons for the code to not reproduce the paper results?
Maybe a different seed for the distribution of images in the folders?
I used the
--only_gradable
flag, it's also not clear whether your paper used all images or only the gradable ones.Thank you!
The text was updated successfully, but these errors were encountered: