-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
w&b error when running with 2 GPU #304
Comments
i've added a tentative fix which only calls that function on the master process. |
there is another error right after first validation: alidation... /root/SimpleTuner/train_sdxl.py FAILED Failures: Root Cause (first observed failure): |
i have never seen that error before, it's new to me! unfortunately, i'm not able to test multi-gpu issues, or reproduce them readily. you could try disabling validations and see if it gets further |
ok understand. |
close it for now. |
Crashes at first step, looks like the second process throws error:
Traceback (most recent call last):
File "/root/SimpleTuner/train_sdxl.py", line 1523, in
main()
File "/root/SimpleTuner/train_sdxl.py", line 1261, in main
logs["timesteps_scatter"] = wandb.plot.scatter(
File "/root/miniconda3/lib/python3.10/site-packages/wandb/plot/scatter.py", line 26, in scatter
return wandb.plot_table(
File "/root/miniconda3/lib/python3.10/site-packages/wandb/sdk/lib/preinit.py", line 36, in preinit_wrapper
raise wandb.Error(f"You must call wandb.init() before {name}()")
wandb.errors.Error: You must call wandb.init() before wandb.plot_table()
Epoch 1/462, Steps: 0%| | 1/6000 [00:11<15:45:28, 9.46s/it, lr=1.67e-8, step_loss=0.126][2024-02-08 09:58:40,980] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1314 closing signal SIGTERM
[2024-02-08 09:58:41,845] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1315) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/root/SimpleTuner/train_sdxl.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-02-08_09:58:40
host : autodl-container-824b40a131-cc439d6f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1315)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: