-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AttributeError: 'DistributedDataParallel' object has no attribute 'dtype' #676
Comments
(flux) [wangxi@v100-4 SimpleTuner]$ nvcc --version |
(flux) [wangxi@v100-4 SimpleTuner]$ pip show bitsandbytes |
please try the fix i just pushed to use base_weight_dtype instead |
should be pytorch 2.4, cuda 12.4 and bitsandbytes 0.43.2 |
how is it fitting here on 4x 32GB? DeepSpeed? |
Running on V100 Server |
pytorch 2.4, cuda 11.8 and bitsandbytes 0.35 |
update but still not fix
Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s] [rank0]:[W808 14:00:21.439799120 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) |
You seem to be running the release branch. The fix is on the main branch. |
already switch to main
|
see what 'glt log' says at the top? |
i have successful install bitsandbytes 0.43.3 but have a new error. 2024-08-09 10:56:17,298 [INFO] (main) Loading our accelerator... [rank0]:[W809 10:56:25.219082534 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) |
Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]'DistributedDataParallel' object has no attribute 'dtype'
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2756, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2060, in main
dtype=transformer.dtype, device=accelerator.device
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1729, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'DistributedDataParallel' object has no attribute 'dtype'
Epoch 1/3, Steps: 0%| | 0/3000 [00:00<?, ?it/s]
(flux) [wangxi@v100-4 SimpleTuner]$
(flux) [wangxi@v100-4 SimpleTuner]$ git pull
Already up to date.
The text was updated successfully, but these errors were encountered: