Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

Open
xiaoweiweixiao opened this issue Mar 30, 2023 · 32 comments

Comments

@xiaoweiweixiao
Copy link

xiaoweiweixiao commented Mar 30, 2023

When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much.

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77807 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77808 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77809 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 77806) of binary: /home/la/anaconda3/envs/alpaca_torch/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_20:18:47
  host      : guest-server
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 77806)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 77806
======================================================

@xiaoweiweixiao
Copy link
Author

When I use multi-GPU to run other codes, I also meet this error. Who can help me?

@kasakh
Copy link

kasakh commented Mar 30, 2023

Can you show what is the command you used to train in multi-GPU environment?

@xiaoweiweixiao
Copy link
Author

python -m torch.distributed.run --nproc_per_node=4 --master_port=11110 train.py
--model_name_or_path ./output/path
--data_path ./alpaca_data.json
--fp16 True
--output_dir ./pretrained
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 False

@ZiweiWangTHU
Copy link

Same problem, have you found the solution?

@FinalFlowers
Copy link

Same problem, hope to have an answer

@optimist-lsc
Copy link

optimist-lsc commented Apr 4, 2023

Please attempt to install the specified version of transformer:
pip install git+https://github.com/zphang/transformers.git
cd transformers
git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176
python setup.py install

@xiaoweiweixiao
Copy link
Author

git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176

Thank you for your advice. There is no setup.py in the transformers, only a [README.md]. So I can not install transformers.

@xv994
Copy link

xv994 commented Apr 7, 2023

pip install git+https://github.com/zphang/transformers.git
cd transformers
git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176
pip install .

you can try it, I use it solving problems

@xiaoweiweixiao
Copy link
Author

pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 pip install .

you can try it, I use it solving problems

Thank you for your advice. I can not run "pip install git+https://github.com/zphang/transformers.git", meet this error:

Collecting git+https://github.com/zphang/transformers.git
  Cloning https://github.com/zphang/transformers.git to /tmp/pip-req-build-8bfk9e3m
  Running command git clone --quiet https://github.com/zphang/transformers.git /tmp/pip-req-build-8bfk9e3m
  Resolved https://github.com/zphang/transformers.git to commit 63a9d6745f679b2eb882e0f147828380981111fa
ERROR: git+https://github.com/zphang/transformers.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

I download the transformers from the ”https://github.com/zphang/transformers“ and "cd transformers"
run "git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176"
meet this error:

Unknown option: --reset
usage: git [--version] [--help] [-C <path>] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

Is there something wrong done by me?

@xv994
Copy link

xv994 commented Apr 11, 2023

sorry, maybe my suggestion last time was wrong.
Your transformers repository is wrong. Please try follow code, it's my operation:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .
If you are Chinese, you can read this link:https://zhuanlan.zhihu.com/p/618321077 , I followed his steps and succeed.

@xiaoweiweixiao
Copy link
Author

pip install .

Thank you very much. My problem is solved follow your suggestion.

@xiaoweiweixiao
Copy link
Author

0041be5

I have another question. I meet the same error in runing other models. This method do not solve my error.
I guess the "0041be5" should change when runing other models (such as GLM130B).
How to change the branch name "0041be5"?

@xv994
Copy link

xv994 commented Apr 14, 2023

I think you may need different virtual python environment to train different model. And I don't know the version of transformers which GLM130B needs, so you'd better to ask their developer or read their guide.

@zhihui-shao
Copy link

我按照这个来了,但还是不行,请问还有解决办法吗

@codemaster17611
Copy link

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

@xv994
Copy link

xv994 commented Apr 26, 2023

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

can you show the error you met?

@xv994
Copy link

xv994 commented Apr 26, 2023

我按照这个来了,但还是不行,请问还有解决办法吗

具体是什么问题呢

@codemaster17611
Copy link

image

@codemaster17611
Copy link

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

can you show the error you met?

always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply

@codemaster17611
Copy link

but i also to monti

Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea

can you show the error you met?

always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply

But I only find 70% of the RAM be used on the backend

@xv994
Copy link

xv994 commented Apr 26, 2023

image

oh, my friend, this is not the main reason, you should let me see the exception above this. And your RAM is enough, my device is less than yours.

@xv994
Copy link

xv994 commented Apr 26, 2023

Have you ever tried this:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .

maybe it works.

@Difang233
Copy link

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

Hi, I have tried this method, but still got this problem, do you have any idea about this?
The version of transformers I used is 4.29.0.dev0.
Thanks in advance!

2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00,  2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_07:22:29
  host      : 56de1ccd4f0e
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 9574)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================

@codemaster17611
Copy link

codemaster17611 commented Apr 26, 2023

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

thank for you reply, i have follow you zhihu step by step.

my transformers version

(llmenv3) [xlwu@mochinelearning transformers]$ git checkout 0041be5
HEAD is now at 0041be5b3 LLaMA Implementation (#21955)

Then i run train script:

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
--nproc_per_node=3 \
--master_port=25001 train.py \
--model_name_or_path /DATA/cdisk/xlwu_workspace/pretrain_model/hf-llama-model/llama-7b \
--data_path /DATA/cdisk/xlwu_workspace/data/test.json \
--output_dir /DATA/cdisk/xlwu_workspace/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json 

the error show:

    WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 3
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:25001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:25001.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22302.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22304.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=25001
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2]
  role_ranks=[0, 1, 2]
  global_ranks=[0, 1, 2]
  role_world_sizes=[3, 3, 3]
  global_world_sizes=[3, 3, 3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/2/error.json
[2023-04-26 16:16:46,603] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29566.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29568.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29570.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29572.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29574.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29576.
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
mochinelearning:3293412:3293412 [0] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293412:3293412 [0] NCCL INFO NET/IB : No device found.
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293412:3293412 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
mochinelearning:3293413:3293413 [1] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293414:3293414 [2] NCCL INFO NET/IB : No device found.
mochinelearning:3293413:3293413 [1] NCCL INFO NET/IB : No device found.
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Using network Socket
mochinelearning:3293413:3293413 [1] NCCL INFO Using network Socket
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00/04 :    0   1   2
mochinelearning:3293414:3293480 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293413:3293481 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01/04 :    0   2   1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02/04 :    0   1   2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03/04 :    0   2   1
mochinelearning:3293412:3293479 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all rings
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all rings
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293480 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293480 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293479 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293479 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293481 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293481 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293414:3293480 [2] NCCL INFO comm 0x7fe1ec002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293481 [1] NCCL INFO comm 0x7fda68002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
mochinelearning:3293412:3293479 [0] NCCL INFO comm 0x7fe718002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
[2023-04-26 16:16:56,205] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00/04 :    0   1   2
mochinelearning:3293413:3293877 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293414:3293876 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01/04 :    0   2   1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02/04 :    0   1   2
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03/04 :    0   2   1
mochinelearning:3293412:3293875 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all rings
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293876 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293876 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293875 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293875 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293877 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293877 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO comm 0x7fe59c002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
mochinelearning:3293414:3293876 [2] NCCL INFO comm 0x7fe064002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293877 [1] NCCL INFO comm 0x7fd8e0002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.544737577438354 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598078966140747 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598520278930664 seconds
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/utils...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 15.1830472946167 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 15.121537685394287 seconds
Time to load utils op: 15.222201108932495 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293414 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3293413) of binary: /DATA/xlwu/anconda3/envs/llmenv3/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0019960403442382812 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last):
  File "/DATA/xlwu/anconda3/envs/llmenv3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_16:19:53
  host      : mochinelearning
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 3293413)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3293413
========================================================

@xv994
Copy link

xv994 commented Apr 27, 2023

your specific error is: Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
You should make sure the version of cuda and torch is matching

@xv994
Copy link

xv994 commented Apr 27, 2023

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .
maybe it works.

Hi, I have tried this method, but still got this problem, do you have any idea about this? The version of transformers I used is 4.29.0.dev0. Thanks in advance!

2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00,  2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_07:22:29
  host      : 56de1ccd4f0e
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 9574)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================

my transformers version is4.28.0.dev0
maybe something wrong, please check your step

@xv994
Copy link

xv994 commented Apr 27, 2023

如果大家可以用中文交流或许可以减少一些沟通成本...

@Lvzhh
Copy link

Lvzhh commented May 3, 2023

This error might occur when there is not enough RAM available. It can happen when using FSDP multiprocess and transformers "from_pretrained" method where each process loads the checkpoint. As a result, the memory usage becomes num_processes * (model_size + size_of_largest_shard), leading to process crashes.

To tackle this issue, we can use DeepSpeed instead of FSDP. DeepSpeed optimizes initialization CPU memory usage, and it only uses num_processes * size_of_largest_shard RAM.

@newstronger
Copy link

The error show:
Traceback (most recent call last):
File "tools/train.py", line 194, in
main()
File "tools/train.py", line 183, in main
train_detector(
File "/home/wangzhang/mmrotate/mmrotate/apis/train.py", line 144, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/wangzhang/mmrotate/mmrotate/models/detectors/single_stage.py", line 81, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train
losses = self.loss(loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss
quality_assess_list, = multi_apply(
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(map_results)))
File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment
sampling_pts_pred_init = self.sampling_points(
File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points
ratio = torch.linspace(0, 1, points_num).to(device).repeat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fdfc57631ee in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x23a21 (0x7fdfede06a21 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void
) + 0x257 (0x7fdfede0b977 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x463418 (0x7fe017356418 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fdfc574a7a5 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x35f2f5 (0x7fe0172522f5 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x679288 (0x7fe01756c288 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object
) + 0x2d5 (0x7fe01756c655 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4d38df]
frame #9: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e55ab]
frame #10: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e55ab]
frame #11: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e07e0]
frame #12: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f1908]
frame #13: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #14: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #15: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #16: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #17: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #18: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #19: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4c91b0]
frame #20: PyDict_SetItemString + 0x52 (0x5819d2 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #21: PyImport_Cleanup + 0x93 (0x5a6b73 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #22: Py_FinalizeEx + 0x71 (0x5a5ca1 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #23: Py_RunMain + 0x112 (0x5a1972 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #24: Py_BytesMain + 0x39 (0x579dd9 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #25: __libc_start_main + 0xe7 (0x7fe02f882c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x579c8d]

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17013 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17014 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17015 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 17016) of binary: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python
Traceback (most recent call last):
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-05-14_16:36:57
host : user-SYS-7049GP-TRT
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 17016)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 17016

@Kangzf1996
Copy link

如果大家可以用中文交流或许可以减少一些沟通成本...

你好,我现在有这个error,请问你知道是什么原因吗?我的transformers是4.28.0.dev0

[2023-06-01 09:44:26,442] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-01 09:44:38,504] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:50<00:00,  1.53s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2360477447509766 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3748812675476074 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 308887) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-01_09:47:02
  host      : alpaca-6655dbbbc6-btc9j
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 308887)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 308887
=======================================================

@rbareja25
Copy link

Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version but nothing works..

@TX-Yeager
Copy link

when u train the model, it will take a lot of time. The process might be killed by sigup.
so don't just use python or torchrun. try this :
nohup python xxx.py &

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests