DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params. #49

xukefaker · 2023-09-10T05:51:12Z

Hi，I met a problem that said ranks have different model. Followings are details.

./train_lora.sh
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in y
our application as needed.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /root/anaconda3/envs/chat-doctor/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
bin /root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda118.so

Finetuning model with params:
base_model: /disk2/data/xk/retr-llm/files/model/llama-7b/
data_path: /disk2/data/xk/retr-llm/files/datasets/mental_health_chatbot_dataset.json
output_dir: ./lora-chatDoctor_bs192_Mbs24_ep3_len512_lr3e-5_fromAlpacaLora
batch_size: 192
micro_batch_size: 24
num_epochs: 3
learning_rate: 3e-05
cutoff_len: 256
val_set_size: 120
use_gradient_checkpointing: False
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: None
bottleneck_size: 256
non_linearity: tanh
adapter_dropout: 0.0
use_parallel_adapter: False
use_adapterp: False
train_on_inputs: True
scaling: 1.0
adapter_name: lora
target_modules: None
group_by_length: False
wandb_project:
wandb_run_name:
wandb_watch:
wandb_log_model:
resume_from_checkpoint: None
Loading checkpoint shards: 100%|##########| 33/33 [00:12<00:00, 2.58it/s]
trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199
Map: 100%|##########| 52/52 [00:00<00:00, 687.22 examples/s]
Map: 100%|##########| 120/120 [00:00<00:00, 765.56 examples/s]
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807082 milliseconds before timi
ng out.
Traceback (most recent call last):
File "train_lora.py", line 353, in
fire.Fire(train)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "train_lora.py", line 299, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
model = self._wrap_model(self.model_wrapped)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1569, in _wrap_model
model = nn.parallel.DistributedDataParallel(
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplet
e data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807082 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807414 milliseconds before timi
ng out.
Traceback (most recent call last):
File "train_lora.py", line 353, in
fire.Fire(train)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "train_lora.py", line 299, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1662, in train
return inner_training_loop(
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1749, in _inner_training_loop
model = self._wrap_model(self.model_wrapped)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/transformers/trainer.py", line 1569, in _wrap_model
model = nn.parallel.DistributedDataParallel(
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 674, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/root/anaconda3/envs/chat-doctor/lib/python3.8/site-packages/torch/distributed/utils.py", line 118, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 3 has 128 params, while rank 0 has inconsistent 0 params.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplet
e data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807414 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1807716 milliseconds before timi
ng out.

my environment：
GPU：8 X A100 80GB
pytorch version：2.0.1

How can I solve this bug？Thanks!

AstroCIEL · 2024-12-05T07:37:07Z

same problem. Do you have any solutions now?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params. #49

DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params. #49

xukefaker commented Sep 10, 2023

AstroCIEL commented Dec 5, 2024

DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params. #49

DDP expects same model across all ranks, but Rank 0 has 128 params, while rank 1 has inconsistent 0 params. #49

Comments

xukefaker commented Sep 10, 2023

AstroCIEL commented Dec 5, 2024