SD3 16GB training error. #8828

paoloski97 · 2024-07-10T13:44:29Z

Describe the bug

When I try to run the script "train_dreambooth_lora_sd3_miniature.py" with the argument "resume_from_checkpoint" it returns the following error:

Traceback (most recent call last): File "/kaggle/working/./diffusers/examples/research_projects/sd3_lora_colab/train_dreambooth_lora_sd3_miniature.py", line 1150, in <module> main(args) File "/kaggle/working/./diffusers/examples/research_projects/sd3_lora_colab/train_dreambooth_lora_sd3_miniature.py", line 934, in main accelerator.load_state(os.path.join(args.output_dir, path)) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 3147, in load_state self.step = override_attributes["step"] KeyError: 'step'

Reproduction

`!accelerate launch ./diffusers/examples/research_projects/sd3_lora_colab/train_dreambooth_lora_sd3_miniature.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-3-medium-diffusers"
--instance_data_dir="dataset"
--data_df_path="./metadata/parquet/sample_embeddings.parquet"
--output_dir="Output"
--mixed_precision="fp16"
--instance_prompt="the_instace_prompt"
--resolution=1024
--train_batch_size=1
--gradient_accumulation_steps=4 --gradient_checkpointing
--checkpointing_steps=150
--max_train_steps=1200
--use_8bit_adam
--learning_rate=1e-4
--lr_scheduler="constant"
--lr_warmup_steps=0
--resume_from_checkpoint='checkpoint-900' \

--seed="2"
--rank=64
--report_to='wandb'`

Logs

Traceback (most recent call last):
  File "/kaggle/working/./diffusers/examples/research_projects/sd3_lora_colab/train_dreambooth_lora_sd3_miniature.py", line 1150, in <module>
    main(args)
  File "/kaggle/working/./diffusers/examples/research_projects/sd3_lora_colab/train_dreambooth_lora_sd3_miniature.py", line 934, in main
    accelerator.load_state(os.path.join(args.output_dir, path))
  File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 3147, in load_state
    self.step = override_attributes["step"]
KeyError: 'step'

System Info

🤗 Diffusers version: 0.30.0.dev0
Platform: Linux-5.15.154+-x86_64-with-glibc2.31
Running on a notebook?: Yes
Running on Google Colab?: No
Python version: 3.10.13
PyTorch version (GPU?): 2.1.2 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26.dev20240504
Huggingface_hub version: 0.23.2
Transformers version: 4.42.3
Accelerate version: 0.32.1
PEFT version: 0.11.1
Bitsandbytes version: 0.43.1
Safetensors version: 0.4.3
xFormers version: not installed
Accelerator: Tesla T4, 15360 MiB
Tesla T4, 15360 MiB VRAM
Using GPU in script?: 2 GPU's
Using distributed or parallel set-up in script?:

Who can help?

No response

The text was updated successfully, but these errors were encountered:

tolgacangoz · 2024-07-10T13:56:50Z

It seems that this is an accelerate-related issue. Could you try to downgrade for now?

paoloski97 · 2024-07-10T18:49:23Z

Sure, do I downgrade to the previous version?

tolgacangoz · 2024-07-10T19:34:37Z

The link I shared says 0.31.0 seems OK.

paoloski97 · 2024-07-11T06:05:17Z

I tried and now it works, thank you.

paoloski97 added the bug Something isn't working label Jul 10, 2024

paoloski97 closed this as completed Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SD3 16GB training error. #8828

SD3 16GB training error. #8828

paoloski97 commented Jul 10, 2024

tolgacangoz commented Jul 10, 2024

paoloski97 commented Jul 10, 2024

tolgacangoz commented Jul 10, 2024

paoloski97 commented Jul 11, 2024

SD3 16GB training error. #8828

SD3 16GB training error. #8828

Comments

paoloski97 commented Jul 10, 2024

Describe the bug

Reproduction

Logs

System Info

Who can help?

tolgacangoz commented Jul 10, 2024

paoloski97 commented Jul 10, 2024

tolgacangoz commented Jul 10, 2024

paoloski97 commented Jul 11, 2024