Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError: 'step' when resume from checkpoint #2923

Closed
2 of 4 tasks
kxhit opened this issue Jul 7, 2024 · 7 comments · Fixed by #2992
Closed
2 of 4 tasks

KeyError: 'step' when resume from checkpoint #2923

kxhit opened this issue Jul 7, 2024 · 7 comments · Fixed by #2992

Comments

@kxhit
Copy link

kxhit commented Jul 7, 2024

System Info

accelerate 0.32.0

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

My training script works fine with accelerate==0.23.0 and when using 0.32.0, and resume from checkpoint (saved by 0.32.0 version as well), I got an error

"accelerate/accelerator.py", line 3147, in load_state
self.step = override_attributes["step"]
KeyError: 'step'"

Expected behavior

I believe this line causes the error and in accelerate==0.23.0, there is no "step".

Hope to get some suggestion in avoiding this bug or get it fixed.

I downgraded my accelerate to 0.31.0 and it works.

@alexanderswerdlow
Copy link

For anyone who has a similar issue, I also encountered this loading of the internal step to be problematic. Specifically, after adding a try/catch, I found that it succeeds on the master rank but not other ranks. In turn, this causes the ranks to become out of sync, in my case with different amounts of gradient accum in the first step. Ultimately, this can result in a hang later on.

@rahji
Copy link

rahji commented Jul 24, 2024

I'm having the same problem. I created a whole new python environment, used pip3 install --force-reinstall -v "accelerate==0.31.0" to install the older version (followed by datasets, torchvision, diffusers, and tensorboard, in my case). I was able to resume from a checkpoint at that point.

@BenjaminBossan
Copy link
Member

Thanks for reporting, as correctly stated, downgrading accelerate is the correct workaround.

This was most likely caused by #2765. IMO it would be best if checkpoints were compatible between accelerate versions, so ideally there is a fix that makes the step key optional to have. Let's see what @muellerzr thinks about this when he's back in office.

@priyammaz
Copy link

I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.

@rbli-john
Copy link

May I ask is there any plan to fix this issue?

@Cuberick-Orion
Copy link

Cuberick-Orion commented Aug 12, 2024

I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.

This worked for me, same scenario.

@simonhessner
Copy link

PRs #2992 and #2765 seem to deal with this issue and they have already been merged. As far as I can see they haven't been released in a new version yet.

Does anyone know when the next release will be published?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants