-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovering from crashed run #74
Comments
Hey @versae! Glad to hear you're enjoying using these scripts! Hmmm, that's very interesting! The closest thing I've seen to that is when I greatly reduced the Do you have an example script I could use to replicate? I have a v3-8 sitting idle that I could use to emulate this behaviour! Utils for properly loading model weights and optimiser states from saved checkpoints are definitely two things that needs to be added! We can probably look to Dalle-mini for help on this: https://github.com/borisdayma/dalle-mini/blob/fc83bc9280772e475946a1b258fe10eba3e5ab8f/tools/train/train.py#L1131 |
Thanks for the quick reply (much quicker than mine now). I use the default Boris' restore state is probably the way to go. In my modified training script, I thought I could achieve the same by using |
Ah that's really frustrating! Sorry to hear it happened so late into training :/ What happens when you try to correct for the LR? Does the loss explode? Saving the optimiser states could help here (rather than re-initialising the momentum terms). |
Hi, thanks for these collection of scripts!
I've been trying to run your
run_flax_speech_recognition_ctc.py
on a single TPUv3-8 but after a few epochs I tend to always run out of memory (not sure if caused by memory leak or something). I also tried to recover from the last checkpoint by skipping the number of steps the model was last saved at, and setting the learning rate appropriately. I also tried modifyingMixedPrecisionTrainState.create()
to it starts at the last saved checkpoint step too. Nothing worked. As soon as it starts training, it runs out of memory again. Any idea of what could be happening?Thanks!
The text was updated successfully, but these errors were encountered: