Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU Training+Single-GPU Eval runs into time out #223

Closed
cronoik opened this issue Dec 31, 2021 · 5 comments · Fixed by #228
Closed

Multi-GPU Training+Single-GPU Eval runs into time out #223

cronoik opened this issue Dec 31, 2021 · 5 comments · Fixed by #228

Comments

@cronoik
Copy link

cronoik commented Dec 31, 2021

Hi everyone,

we run into a timeout when we evaluate for more than 30 minutes on a single GPU. Is there a way to tell the other GPU to wait until the main GPU completes the evaluation?

[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802157 milliseconds before timing out. 
Traceback (most recent call last): 
Traceback (most recent call last): 
 File "scripts/training_test.py", line 172, in <module> 
 File "scripts/training_test.py", line 172, in <module> 
   main(args)    
main(args) File "scripts/training_test.py", line 167, in main 
 
 File "scripts/training_test.py", line 167, in main 
   train(args.config) 
 File "scripts/training_test.py", line 148, in train 
   train(args.config) 
 File "scripts/training_test.py", line 148, in train 
   trainer.train_pipeline() 
 File "/home/azureuser/mytrainer.py", line 182, in train_pipeline 
   trainer.train_pipeline() 
 File "/home/azureuser/mytrainer.py", line 182, in train_pipeline 
       for step, batch in enumerate(pbar):for step, batch in enumerate(pbar): 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__ 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__ 
       for obj in iterable:for obj in iterable: 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/data_loader.py", line 301, in __iter__ 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/data_loader.py", line 301, in __iter__ 
       synchronize_rng_states(self.rng_types, self.generator)synchronize_rng_states(self.rng_types, self.generator) 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 110, in synchronize_rng_states 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 110, in synchronize_rng_states 
       synchronize_rng_state(RNGType(rng_type), generator=generator)synchronize_rng_state(RNGType(rng_type), generator=generator) 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 105, in synchronize_rng_state 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 105, in synchronize_rng_state 
       generator.set_state(rng_state)generator.set_state(rng_state) 
 
RuntimeErrorRuntimeError: : Invalid mt19937 stateInvalid mt19937 state 
 
^MEvaluating ... : 47%|███████████████████████████▎                             | 2550/5411 [30:02<29:18, 1.63it/s][E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout
(ms)=1800000) ran for 1802593 milliseconds before timing out. 
^MEvaluating ... : 47%|███████████████████████████▎                             | 2551/5411 [30:02<29:01, 1.64it/s][E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout
(ms)=1800000) ran for 1802971 milliseconds before timing out.

@sgugger Can you please have a look?

@sgugger
Copy link
Collaborator

sgugger commented Jan 10, 2022

The timeout et 30s comes from PyTorch, but you can adjust it when initializing the distributed process. Accelerate does it automatically but only if you haven't done it yourself in the script. I'll expose that argument this week or the next, but in the meantime, you can use this line as a workaround:

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

The default is 3600.

@santarabantoosoo
Copy link

I didn't know how to set this as an argument if I am using python -m torch.distributed.launch

I found the argument ddp_timeout in the trainingargs and I have used it.

I am commenting on this for:
1- making sure I am correct
2- Helping others if they face the same issue

Here is my example terminal command

python -m torch.distributed.launch \
    --nproc_per_node 8 run_mlm.py \
    --ddp_timeout 7200 \
    --fp16 \
    --model_name_or_path bert-base-cased \
    --train_file data/msgs_train_online_text-all_disch-all.txt \
    --validation_file msgs_valid.txt \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --do_train \
    --do_eval \
    --output_dir /models/msgs_online_disch_MLM \
    --overwrite_output_dir 

@yawzhe
Copy link

yawzhe commented Mar 18, 2024

我谁知了ddp,但是运行llama_factory时候,要加载两次 tokenzier,第一次160w顺利加载,第二次就报错了
Uploading 微信图片_20240318191811.jpg…

@yawzhe
Copy link

yawzhe commented Mar 18, 2024

Uploading 微信图片_20240318191811.jpg…

@Neo9061
Copy link

Neo9061 commented Jun 25, 2024

The timeout et 30s comes from PyTorch, but you can adjust it when initializing the distributed process. Accelerate does it automatically but only if you haven't done it yourself in the script. I'll expose that argument this week or the next, but in the meantime, you can use this line as a workaround:

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

The default is 3600.

@sgugger if I used with FSDP distributed fine-tuning, is the parameter controlled by ddp_timeout? I modified this argument but still see timeout. See my second issue in huggingface/transformers#31577

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants