Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

w&b error when running with 2 GPU #304

Closed
zhuliyi0 opened this issue Feb 8, 2024 · 5 comments
Closed

w&b error when running with 2 GPU #304

zhuliyi0 opened this issue Feb 8, 2024 · 5 comments
Labels
bug Something isn't working pending This issue has a fix that is awaiting test results.

Comments

@zhuliyi0
Copy link

zhuliyi0 commented Feb 8, 2024

Crashes at first step, looks like the second process throws error:

Traceback (most recent call last):
File "/root/SimpleTuner/train_sdxl.py", line 1523, in
main()
File "/root/SimpleTuner/train_sdxl.py", line 1261, in main
logs["timesteps_scatter"] = wandb.plot.scatter(
File "/root/miniconda3/lib/python3.10/site-packages/wandb/plot/scatter.py", line 26, in scatter
return wandb.plot_table(
File "/root/miniconda3/lib/python3.10/site-packages/wandb/sdk/lib/preinit.py", line 36, in preinit_wrapper
raise wandb.Error(f"You must call wandb.init() before {name}()")
wandb.errors.Error: You must call wandb.init() before wandb.plot_table()
Epoch 1/462, Steps: 0%| | 1/6000 [00:11<15:45:28, 9.46s/it, lr=1.67e-8, step_loss=0.126][2024-02-08 09:58:40,980] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1314 closing signal SIGTERM
[2024-02-08 09:58:41,845] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1315) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/SimpleTuner/train_sdxl.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-02-08_09:58:40
host : autodl-container-824b40a131-cc439d6f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1315)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@bghira
Copy link
Owner

bghira commented Feb 8, 2024

i've added a tentative fix which only calls that function on the master process.

@bghira bghira added bug Something isn't working pending This issue has a fix that is awaiting test results. labels Feb 8, 2024
@zhuliyi0
Copy link
Author

zhuliyi0 commented Feb 8, 2024

there is another error right after first validation:

alidation...
Generating 7 images.
Loaded scheduler as EulerDiscreteScheduler from scheduler subfolder of /root/autodl-tmp/sd_xl_base_1.0. | 0/3 [00:00<?, ?it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1161.21it/s]
Epoch 1/261, Steps: 0%| | 1/6000 [01:18<13:32:14, 8.12s/it, lr=1.67e-8, step_loss=0.125]/root/miniconda3/lib/python3.10/site-packages/torch/autograd/init.py:266: UserWarning: Error detected in MulBackward0. Traceback of forward call that caused the error:
File "/root/SimpleTuner/train_sdxl.py", line 1527, in
main()
File "/root/SimpleTuner/train_sdxl.py", line 1151, in main
model_pred = unet(
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/utils/operations.py", line 581, in forward
return model_forward(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/utils/operations.py", line 569, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/root/miniconda3/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1186, in forward
sample = upsample_block(
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 2413, in forward
hidden_states = attn(
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py", line 379, in forward
hidden_states = torch.utils.checkpoint.checkpoint(
File "/root/miniconda3/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
return fn(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
return fn(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint
ret = function(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py", line 374, in custom_forward
return module(*inputs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/attention.py", line 400, in forward
ff_output = self.ff(norm_hidden_states, scale=lora_scale)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/attention.py", line 672, in forward
hidden_states = module(hidden_states, scale)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/diffusers/models/activations.py", line 103, in forward
return hidden_states * self.gelu(gate)
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:113.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/root/SimpleTuner/train_sdxl.py", line 1527, in
main()
File "/root/SimpleTuner/train_sdxl.py", line 1215, in main
accelerator.backward(loss)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/accelerator.py", line 1853, in backward
loss.backward(**kwargs)
File "/root/miniconda3/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
torch.autograd.backward(
File "/root/miniconda3/lib/python3.10/site-packages/torch/autograd/init.py", line 266, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'MulBackward0' returned nan values in its 0th output.
[2024-02-08 10:19:37,299] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2701 closing signal SIGTERM
[2024-02-08 10:19:38,064] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 2702) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
File "/root/miniconda3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/root/SimpleTuner/train_sdxl.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-02-08_10:19:37
host : autodl-container-824b40a131-cc439d6f
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2702)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@bghira
Copy link
Owner

bghira commented Feb 8, 2024

i have never seen that error before, it's new to me! unfortunately, i'm not able to test multi-gpu issues, or reproduce them readily. you could try disabling validations and see if it gets further

@zhuliyi0
Copy link
Author

zhuliyi0 commented Feb 8, 2024

ok understand.

@zhuliyi0
Copy link
Author

zhuliyi0 commented Feb 8, 2024

close it for now.

@zhuliyi0 zhuliyi0 closed this as completed Feb 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This issue has a fix that is awaiting test results.
Projects
None yet
Development

No branches or pull requests

2 participants