Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676

Closed
magicwang1111 opened this issue Aug 8, 2024 · 12 comments
Labels
bug Something isn't working pending This issue has a fix that is awaiting test results. regression This bug has regressed behaviour that previously worked.

Comments

@magicwang1111
Copy link

  • Total optimization steps = 3000
  • Total optimization steps remaining = 3000
    Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]'DistributedDataParallel' object has no attribute 'dtype'
    Traceback (most recent call last):
    File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2756, in
    main()
    File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2060, in main
    dtype=transformer.dtype, device=accelerator.device
    File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
    raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
    AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'

Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]

(flux) [wangxi@v100-4 SimpleTuner]$
(flux) [wangxi@v100-4 SimpleTuner]$ git pull
Already up to date.

@magicwang1111
Copy link
Author

(flux) [wangxi@v100-4 SimpleTuner]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
(flux) [wangxi@v100-4 SimpleTuner]$

@magicwang1111
Copy link
Author

magicwang1111 commented Aug 8, 2024

(flux) [wangxi@v100-4 SimpleTuner]$ pip show bitsandbytes
Name: bitsandbytes
Version: 0.35.0
Summary: 8-bit optimizers and matrix multiplication routines.
Home-page: https://github.com/TimDettmers/bitsandbytes
Author: Tim Dettmers
Author-email: dettmers@cs.washington.edu
License: MIT
Location: /mnt/data/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages
Requires:
Required-by:
(flux) [wangxi@v100-4 SimpleTuner]$
Due to my cuda version problem, I can't install the bitsandbytes 0.42 you recommended

@bghira
Copy link
Owner

bghira commented Aug 8, 2024

please try the fix i just pushed to use base_weight_dtype instead

@bghira bghira added regression This bug has regressed behaviour that previously worked. pending This issue has a fix that is awaiting test results. labels Aug 8, 2024
@bghira
Copy link
Owner

bghira commented Aug 8, 2024

should be pytorch 2.4, cuda 12.4 and bitsandbytes 0.43.2

@bghira
Copy link
Owner

bghira commented Aug 8, 2024

how is it fitting here on 4x 32GB? DeepSpeed?

@magicwang1111
Copy link
Author

how is it fitting here on 4x 32GB? DeepSpeed?

Running on V100 Server

@magicwang1111
Copy link
Author

should be pytorch 2.4, cuda 12.4 and bitsandbytes 0.43.2

pytorch 2.4, cuda 11.8 and bitsandbytes 0.35

@magicwang1111
Copy link
Author

magicwang1111 commented Aug 8, 2024

please try the fix i just pushed to use base_weight_dtype instead

update but still not fix
2024-08-08 14:00:06,858 [INFO] (main) Moving the diffusion transformer to GPU in torch.bfloat16 precision.
2024-08-08 14:00:06,874 [INFO] (main)
***** Running training *****

  • Num batches = 1008
  • Num Epochs = 3
  • Current Epoch = 1
  • Total train batch size (w. parallel, distributed & accumulation) = 1
  • Instantaneous batch size per device = 1
  • Gradient Accumulation steps = 1
  • Total optimization steps = 3000
  • Total optimization steps remaining = 3000
    Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]'DistributedDataParallel' object has no attribute 'dtype'
    Traceback (most recent call last):
    File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2756, in
    main()
    File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2060, in main
    dtype=transformer.dtype, device=accelerator.device
    File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
    raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
    AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'

Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]

[rank0]:[W808 14:00:21.439799120 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(flux) [wangxi@v100-4 SimpleTuner]$

@mhirki
Copy link
Contributor

mhirki commented Aug 8, 2024

You seem to be running the release branch. The fix is on the main branch.

@magicwang1111
Copy link
Author

You seem to be running the release branch. The fix is on the main branch.

already switch to main
(flux) [wangxi@v100-4 SimpleTuner]$ git branch

  • main
    release
    (flux) [wangxi@v100-4 SimpleTuner]$

@bghira
Copy link
Owner

bghira commented Aug 8, 2024

see what 'glt log' says at the top?

@bghira bghira added the bug Something isn't working label Aug 8, 2024
@bghira bghira closed this as completed Aug 8, 2024
@magicwang1111
Copy link
Author

see what 'glt log' says at the top?

i have successful install bitsandbytes 0.43.3 but have a new error.

2024-08-09 10:56:17,298 [INFO] (main) Loading our accelerator...
Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2762, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 1341, in main
results = accelerator.prepare(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1299, in prepare
result = tuple(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1300, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1176, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1435, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 784, in init
self._log_and_throw(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1127, in _log_and_throw
raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules

[rank0]:[W809 10:56:25.219082534 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
(flux) [wangxi@v100-4 SimpleTuner]$ ^C
(flux) [wangxi@v100-4 SimpleTuner]$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This issue has a fix that is awaiting test results. regression This bug has regressed behaviour that previously worked.
Projects
None yet
Development

No branches or pull requests

3 participants