-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alternating 1 bad 1 good during training #37
Comments
Hi! Does this behavior occur across different random seeds too? E.g. if you restarted training using a different random seed, would you notice this same loss pattern again? Also, are the actual loss values (not the pattern) different every time you reshuffle the data or are they the same? Also, what does the train/val loss curve look like across all iterations? would you mind sharing some loss plots from tensorboard? |
About tensorboard unfortunately i always trash logs folder for archived trainings, but i can provide the below csv for the latest training i did on my pc, every time there is a double value for the same iteration is because there was a resume. Lastly, just for testing, i've resumed the above training changing batch size from 4 to 2, the issue is now much more evident, here's the log (there's a resume at iteration 98922):
another strange thing is that if i redo a validation on the checkpoint that had 5.52 as loss i get a 5.8ish one. |
about validation loss i did some more tests, basically the values given during training can't be used as exact metric to compare results.
so looking at the values while training it looks it goes up and down. Lastly, this behavior affects both train and val loops, as i always have almost the same delta between train and val losses |
Hi Hugo,
i'm encountering a very strange behavior during training, basically it cycles validations giving 1 higher loss followed by a lower loss, and so on.
below are the validation loss from the latest training for example:
5.92
5.89
5.93
5.88
5.92
5.88
5.91
5.86
5.90
Next one will be "good" probably
consider that learning rate is fixed as i'm using rlrop optimizer
At first i was thinking there was something wrong audiodataset shuffle from audiotools, so i have disabled shuffle for the validation set and i have forced a reshuffke after each validation cycle using timestamp as seed to make sure that it will be different for each cycle, but i still have this behavior alternating one good and 1 bad.
Dataset/train loss also follows this behavior no matter if i reshuffle so i'm wondering if there is something else i'm not aware of or if there's something that doesn't works as expected during the shuffle.
in vampnet.yml i have the below settings:
AudioDataset.without_replacement: true
AudioLoader.shuffle: true
val/AudioLoader.shuffle: false
one training cycle is exactly 1 ephoc (90k+ chunks)
what i have noticed from the console:
the output says there is a shuffle on the AudioLoader but on the Audiodataset is False.
Don't know if is related
what else i could look for ? it shouldn't behave like this assuming the randomness of provided training data.
thanks
The text was updated successfully, but these errors were encountered: