Failed to resume from state #1524

tristanwqy · 2024-08-29T02:59:01Z

load train state from /home/ubuntu/MyFiles/xiaodi/training/output/flux/sd-scripts/flux_lora_200k/flux_lora_200k-step00002250-state/train_state.json: {'current_epoch': 1, 'current_step': 2250}
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/sd-scripts/flux_train_network.py", line 411, in <module>
[rank0]:     trainer.train(args)
[rank0]:   File "/home/ubuntu/sd-scripts/train_network.py", line 663, in train
[rank0]:     train_util.resume_from_local_or_hf_if_specified(accelerator, args)
[rank0]:   File "/home/ubuntu/sd-scripts/library/train_util.py", line 4260, in resume_from_local_or_hf_if_specified
[rank0]:     accelerator.load_state(args.resume)
[rank0]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/accelerator.py", line 3145, in load_state
[rank0]:     override_attributes = load_accelerator_state(
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/checkpointing.py", line 208, in load_accelerator_state
[rank0]:     load_model(model, input_model_file, device=str(map_location), **load_model_func_kwargs)
[rank0]: TypeError: load_model() got an unexpected keyword argument 'device'

It caused by the wrong version safetensors==0.4.2 in requirements.txt, safetensors add this argument in this commit huggingface/safetensors@ff643a8, upgrade to safetensors==0.4.4 solved this problem.

another error occupied is

[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/ubuntu/sd-scripts/flux_train_network.py", line 411, in <module>
[rank2]:     trainer.train(args)
[rank2]:   File "/home/ubuntu/sd-scripts/train_network.py", line 663, in train
[rank2]:     train_util.resume_from_local_or_hf_if_specified(accelerator, args)
[rank2]:   File "/home/ubuntu/sd-scripts/library/train_util.py", line 4260, in resume_from_local_or_hf_if_specified
[rank2]:     accelerator.load_state(args.resume)
[rank2]:   File "/home/ubuntu/miniconda3/envs/sd-scripts/lib/python3.11/site-packages/accelerate/accelerator.py", line 3156, in load_state
[rank2]:     self.step = override_attributes["step"]

downgrade accelerate to accelerate==0.31.0 solved this problem

The text was updated successfully, but these errors were encountered:

kohya-ss · 2024-08-29T13:26:57Z

Thank you! I've updated requirements.txt for the first issue. I will investigate the 2nd issue.

kohya-ss · 2024-08-29T13:30:22Z

The second issue doesn't seem to happen to me, any idea what could be causing it?

tristanwqy · 2024-08-30T01:54:31Z

The second issue doesn't seem to happen to me, any idea what could be causing it?

the version of accelerate, I fixed it by downgrade to acccelerate==0.31.0, according to a issue in accelerate library huggingface/accelerate#2923

kohya-ss added a commit that referenced this issue Aug 29, 2024

Update safetensors to version 0.4.4 in requirements.txt #1524

8fdfd8c

kohya-ss closed this as completed Aug 29, 2024

kohya-ss reopened this Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to resume from state #1524

Failed to resume from state #1524

tristanwqy commented Aug 29, 2024

kohya-ss commented Aug 29, 2024 •

edited

Loading

kohya-ss commented Aug 29, 2024

tristanwqy commented Aug 30, 2024 •

edited

Loading

Failed to resume from state #1524

Failed to resume from state #1524

Comments

tristanwqy commented Aug 29, 2024

kohya-ss commented Aug 29, 2024 • edited Loading

kohya-ss commented Aug 29, 2024

tristanwqy commented Aug 30, 2024 • edited Loading

kohya-ss commented Aug 29, 2024 •

edited

Loading

tristanwqy commented Aug 30, 2024 •

edited

Loading