Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to resume from state #1524

Open
tristanwqy opened this issue Aug 29, 2024 · 3 comments
Open

Failed to resume from state #1524

tristanwqy opened this issue Aug 29, 2024 · 3 comments

Comments

@tristanwqy
Copy link

load train state from /home/ubuntu/MyFiles/xiaodi/training/output/flux/sd-scripts/flux_lora_200k/flux_lora_200k-step00002250-state/train_state.json: {'current_epoch': 1, 'current_step': 2250}
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/sd-scripts/flux_train_network.py", line 411, in <module>
[rank0]:     trainer.train(args)
[rank0]:   File "/home/ubuntu/sd-scripts/train_network.py", line 663, in train
[rank0]:     train_util.resume_from_local_or_hf_if_specified(accelerator, args)
[rank0]:   File "/home/ubuntu/sd-scripts/library/train_util.py", line 4260, in resume_from_local_or_hf_if_specified
[rank0]:     accelerator.load_state(args.resume)
[rank0]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/accelerator.py", line 3145, in load_state
[rank0]:     override_attributes = load_accelerator_state(
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/checkpointing.py", line 208, in load_accelerator_state
[rank0]:     load_model(model, input_model_file, device=str(map_location), **load_model_func_kwargs)
[rank0]: TypeError: load_model() got an unexpected keyword argument 'device'

It caused by the wrong version safetensors==0.4.2 in requirements.txt, safetensors add this argument in this commit huggingface/safetensors@ff643a8, upgrade to safetensors==0.4.4 solved this problem.

another error occupied is

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/sd-scripts/flux_train_network.py", line 411, in <module>
[rank2]:     trainer.train(args)
[rank2]:   File "/home/ubuntu/sd-scripts/train_network.py", line 663, in train
[rank2]:     train_util.resume_from_local_or_hf_if_specified(accelerator, args)
[rank2]:   File "/home/ubuntu/sd-scripts/library/train_util.py", line 4260, in resume_from_local_or_hf_if_specified
[rank2]:     accelerator.load_state(args.resume)
[rank2]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/accelerator.py", line 3156, in load_state
[rank2]:     self.step = override_attributes["step"]

downgrade accelerate to accelerate==0.31.0 solved this problem

@kohya-ss
Copy link
Owner

kohya-ss commented Aug 29, 2024

Thank you! I've updated requirements.txt for the first issue. I will investigate the 2nd issue.

@kohya-ss kohya-ss reopened this Aug 29, 2024
@kohya-ss
Copy link
Owner

The second issue doesn't seem to happen to me, any idea what could be causing it?

@tristanwqy
Copy link
Author

tristanwqy commented Aug 30, 2024

The second issue doesn't seem to happen to me, any idea what could be causing it?

the version of accelerate, I fixed it by downgrade to acccelerate==0.31.0, according to a issue in accelerate library huggingface/accelerate#2923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants