Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation while training doesn't seem to load current model's weights #120

Closed
afm215 opened this issue Aug 23, 2023 · 4 comments
Closed

Comments

@afm215
Copy link

afm215 commented Aug 23, 2023

I noticed this issue while training a iresnet. At the end of the training, the accuracy shown on the 5 validation dataset are dramatically close to 0.5. I thus added a print call at line 175 of train_val.py: print(f"{k} : {v}", flush=True). By doing so I realized that the model's performance is close to random at each epoch, during the validation step. But when running the evaluation after the training (i.e. by running main with --evaluate ), the accuracies are much more coherent (i.e. much better).
I do think there is an issue while loading the current model's weights at the validation steps during training.

@afm215
Copy link
Author

afm215 commented Sep 20, 2023

Ok problem solved. This was a direct consequence of #125. Solved by using srun correctly.

@afm215 afm215 closed this as completed Sep 20, 2023
@too-sheep
Copy link

Hi, I'm having the same problem and would like to ask exactly how to use srun correctly.

@afm215
Copy link
Author

afm215 commented Jun 20, 2024

Sure! Have you updated the get_world_size and get_local_rank functions as shown in #125 ?

@too-sheep
Copy link

Thanks for the quick reply, the problem solved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants