KeyError: 'step' when resume from checkpoint #2923

kxhit · 2024-07-07T14:42:17Z

System Info

accelerate 0.32.0

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

My training script works fine with accelerate==0.23.0 and when using 0.32.0, and resume from checkpoint (saved by 0.32.0 version as well), I got an error

"accelerate/accelerator.py", line 3147, in load_state
self.step = override_attributes["step"]
KeyError: 'step'"

Expected behavior

I believe this line causes the error and in accelerate==0.23.0, there is no "step".

Hope to get some suggestion in avoiding this bug or get it fixed.

I downgraded my accelerate to 0.31.0 and it works.

The text was updated successfully, but these errors were encountered:

alexanderswerdlow · 2024-07-16T18:22:36Z

For anyone who has a similar issue, I also encountered this loading of the internal step to be problematic. Specifically, after adding a try/catch, I found that it succeeds on the master rank but not other ranks. In turn, this causes the ranks to become out of sync, in my case with different amounts of gradient accum in the first step. Ultimately, this can result in a hang later on.

rahji · 2024-07-24T21:19:11Z

I'm having the same problem. I created a whole new python environment, used pip3 install --force-reinstall -v "accelerate==0.31.0" to install the older version (followed by datasets, torchvision, diffusers, and tensorboard, in my case). I was able to resume from a checkpoint at that point.

BenjaminBossan · 2024-07-25T09:29:51Z

Thanks for reporting, as correctly stated, downgrading accelerate is the correct workaround.

This was most likely caused by #2765. IMO it would be best if checkpoints were compatible between accelerate versions, so ideally there is a fix that makes the step key optional to have. Let's see what @muellerzr thinks about this when he's back in office.

priyammaz · 2024-07-29T17:29:06Z

I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.

rbli-john · 2024-08-06T07:28:21Z

May I ask is there any plan to fix this issue?

Cuberick-Orion · 2024-08-12T07:35:46Z

I just build my environment so i was running the newest 0.33.0 accelerate version. I saved a checkpoint with this version, when when trying to load it, it throws the key error for "step". I downgraded to 0.31 and its totally fine now, but just thought id mention that even within the same version of accelerate there may be a slight issue.

This worked for me, same scenario.

simonhessner · 2024-08-22T17:23:59Z

PRs #2992 and #2765 seem to deal with this issue and they have already been merged. As far as I can see they haven't been released in a new version yet.

Does anyone know when the next release will be published?

breengles mentioned this issue Jul 9, 2024

State saved with previous version of accelerate does not have such key in overrides #2924

Closed

tolgacangoz mentioned this issue Jul 10, 2024

SD3 16GB training error. huggingface/diffusers#8828

Closed

muellerzr mentioned this issue Aug 6, 2024

Explicit check for step when loading the state #2992

Merged

5 tasks

muellerzr closed this as completed in #2992 Aug 6, 2024

tristanwqy mentioned this issue Aug 30, 2024

Failed to resume from state kohya-ss/sd-scripts#1524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError: 'step' when resume from checkpoint #2923

KeyError: 'step' when resume from checkpoint #2923

kxhit commented Jul 7, 2024 •

edited

Loading

alexanderswerdlow commented Jul 16, 2024

rahji commented Jul 24, 2024 •

edited

Loading

BenjaminBossan commented Jul 25, 2024

priyammaz commented Jul 29, 2024

rbli-john commented Aug 6, 2024

Cuberick-Orion commented Aug 12, 2024 •

edited

Loading

simonhessner commented Aug 22, 2024

KeyError: 'step' when resume from checkpoint #2923

KeyError: 'step' when resume from checkpoint #2923

Comments

kxhit commented Jul 7, 2024 • edited Loading

System Info

Information

Tasks

Reproduction

Expected behavior

alexanderswerdlow commented Jul 16, 2024

rahji commented Jul 24, 2024 • edited Loading

BenjaminBossan commented Jul 25, 2024

priyammaz commented Jul 29, 2024

rbli-john commented Aug 6, 2024

Cuberick-Orion commented Aug 12, 2024 • edited Loading

simonhessner commented Aug 22, 2024

kxhit commented Jul 7, 2024 •

edited

Loading

rahji commented Jul 24, 2024 •

edited

Loading

Cuberick-Orion commented Aug 12, 2024 •

edited

Loading