AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676

magicwang1111 · 2024-08-08T03:29:52Z

Total optimization steps = 3000
Total optimization steps remaining = 3000
Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]'DistributedDataParallel' object has no attribute 'dtype'
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2756, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2060, in main
dtype=transformer.dtype, device=accelerator.device
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'

Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]

(flux) [wangxi@v100-4 SimpleTuner]$
(flux) [wangxi@v100-4 SimpleTuner]$ git pull
Already up to date.

magicwang1111 · 2024-08-08T03:30:39Z

(flux) [wangxi@v100-4 SimpleTuner]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
(flux) [wangxi@v100-4 SimpleTuner]$

magicwang1111 · 2024-08-08T03:33:22Z

(flux) [wangxi@v100-4 SimpleTuner]$ pip show bitsandbytes
Name: bitsandbytes
Version: 0.35.0
Summary: 8-bit optimizers and matrix multiplication routines.
Home-page: https://github.com/TimDettmers/bitsandbytes
Author: Tim Dettmers
Author-email: dettmers@cs.washington.edu
License: MIT
Location: /mnt/data/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages
Requires:
Required-by:
(flux) [wangxi@v100-4 SimpleTuner]$
Due to my cuda version problem, I can't install the bitsandbytes 0.42 you recommended

bghira · 2024-08-08T04:49:30Z

please try the fix i just pushed to use base_weight_dtype instead

bghira · 2024-08-08T04:51:17Z

should be pytorch 2.4, cuda 12.4 and bitsandbytes 0.43.2

bghira · 2024-08-08T04:54:53Z

how is it fitting here on 4x 32GB? DeepSpeed?

magicwang1111 · 2024-08-08T05:35:42Z

how is it fitting here on 4x 32GB? DeepSpeed?

Running on V100 Server

magicwang1111 · 2024-08-08T06:02:35Z

should be pytorch 2.4, cuda 12.4 and bitsandbytes 0.43.2

pytorch 2.4, cuda 11.8 and bitsandbytes 0.35

magicwang1111 · 2024-08-08T06:03:24Z

please try the fix i just pushed to use base_weight_dtype instead

update but still not fix
2024-08-08 14:00:06,858 [INFO] (main) Moving the diffusion transformer to GPU in torch.bfloat16 precision.
2024-08-08 14:00:06,874 [INFO] (main)
***** Running training *****

Num batches = 1008
Num Epochs = 3
Current Epoch = 1
Total train batch size (w. parallel, distributed & accumulation) = 1
Instantaneous batch size per device = 1
Gradient Accumulation steps = 1
Total optimization steps = 3000
Total optimization steps remaining = 3000
Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]'DistributedDataParallel' object has no attribute 'dtype'
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2756, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2060, in main
dtype=transformer.dtype, device=accelerator.device
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'

Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]

[rank0]:[W808 14:00:21.439799120 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(flux) [wangxi@v100-4 SimpleTuner]$

mhirki · 2024-08-08T06:56:51Z

You seem to be running the release branch. The fix is on the main branch.

magicwang1111 · 2024-08-08T09:24:09Z

You seem to be running the release branch. The fix is on the main branch.

already switch to main
(flux) [wangxi@v100-4 SimpleTuner]$ git branch

main
release
(flux) [wangxi@v100-4 SimpleTuner]$

bghira · 2024-08-08T13:55:58Z

see what 'glt log' says at the top?

magicwang1111 · 2024-08-09T03:06:11Z

see what 'glt log' says at the top?

i have successful install bitsandbytes 0.43.3 but have a new error.

2024-08-09 10:56:17,298 [INFO] (main) Loading our accelerator...
Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2762, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 1341, in main
results = accelerator.prepare(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1299, in prepare
result = tuple(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1300, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1176, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1435, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 784, in init
self._log_and_throw(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1127, in _log_and_throw
raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules

[rank0]:[W809 10:56:25.219082534 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(flux) [wangxi@v100-4 SimpleTuner]$ ^C
(flux) [wangxi@v100-4 SimpleTuner]$

bghira added regression This bug has regressed behaviour that previously worked. pending This issue has a fix that is awaiting test results. labels Aug 8, 2024

bghira added the bug Something isn't working label Aug 8, 2024

bghira closed this as completed Aug 8, 2024

magicwang1111 mentioned this issue Aug 9, 2024

Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676

AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024 •

edited

Loading

bghira commented Aug 8, 2024

bghira commented Aug 8, 2024

bghira commented Aug 8, 2024 •

edited

Loading

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024 •

edited

Loading

mhirki commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024

bghira commented Aug 8, 2024

magicwang1111 commented Aug 9, 2024

AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676

AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676

Comments

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024 • edited Loading

bghira commented Aug 8, 2024

bghira commented Aug 8, 2024

bghira commented Aug 8, 2024 • edited Loading

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024 • edited Loading

mhirki commented Aug 8, 2024

magicwang1111 commented Aug 8, 2024

bghira commented Aug 8, 2024

magicwang1111 commented Aug 9, 2024

magicwang1111 commented Aug 8, 2024 •

edited

Loading

bghira commented Aug 8, 2024 •

edited

Loading

magicwang1111 commented Aug 8, 2024 •

edited

Loading