Validation while training doesn't seem to load current model's weights #120

afm215 · 2023-08-23T17:45:53Z

I noticed this issue while training a iresnet. At the end of the training, the accuracy shown on the 5 validation dataset are dramatically close to 0.5. I thus added a print call at line 175 of train_val.py: print(f"{k} : {v}", flush=True). By doing so I realized that the model's performance is close to random at each epoch, during the validation step. But when running the evaluation after the training (i.e. by running main with --evaluate ), the accuracies are much more coherent (i.e. much better).
I do think there is an issue while loading the current model's weights at the validation steps during training.

afm215 · 2023-09-20T09:42:27Z

Ok problem solved. This was a direct consequence of #125. Solved by using srun correctly.

too-sheep · 2024-06-20T06:11:51Z

Hi, I'm having the same problem and would like to ask exactly how to use srun correctly.

afm215 · 2024-06-20T08:11:49Z

Sure! Have you updated the get_world_size and get_local_rank functions as shown in #125 ?

too-sheep · 2024-06-21T07:30:19Z

Thanks for the quick reply, the problem solved!

afm215 closed this as completed Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation while training doesn't seem to load current model's weights #120

Validation while training doesn't seem to load current model's weights #120

afm215 commented Aug 23, 2023

afm215 commented Sep 20, 2023

too-sheep commented Jun 20, 2024

afm215 commented Jun 20, 2024

too-sheep commented Jun 21, 2024

Validation while training doesn't seem to load current model's weights #120

Validation while training doesn't seem to load current model's weights #120

Comments

afm215 commented Aug 23, 2023

afm215 commented Sep 20, 2023

too-sheep commented Jun 20, 2024

afm215 commented Jun 20, 2024

too-sheep commented Jun 21, 2024