-
Notifications
You must be signed in to change notification settings - Fork 974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError: 'step' when resume from checkpoint #2923
Comments
For anyone who has a similar issue, I also encountered this loading of the internal step to be problematic. Specifically, after adding a try/catch, I found that it succeeds on the master rank but not other ranks. In turn, this causes the ranks to become out of sync, in my case with different amounts of gradient accum in the first step. Ultimately, this can result in a hang later on. |
I'm having the same problem. I created a whole new python environment, used |
Thanks for reporting, as correctly stated, downgrading accelerate is the correct workaround. This was most likely caused by #2765. IMO it would be best if checkpoints were compatible between accelerate versions, so ideally there is a fix that makes the |
I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue. |
May I ask is there any plan to fix this issue? |
This worked for me, same scenario. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
My training script works fine with accelerate==0.23.0 and when using 0.32.0, and resume from checkpoint (saved by 0.32.0 version as well), I got an error
"accelerate/accelerator.py", line 3147, in load_state
self.step = override_attributes["step"]
KeyError: 'step'"
Expected behavior
I believe this line causes the error and in accelerate==0.23.0, there is no "step".
Hope to get some suggestion in avoiding this bug or get it fixed.
I downgraded my accelerate to 0.31.0 and it works.
The text was updated successfully, but these errors were encountered: