error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

xiaoweiweixiao · 2023-03-30T12:32:53Z

When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much.

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/transformers/training_args.py:1356: FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead 
  warnings.warn(
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77807 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77808 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77809 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 77806) of binary: /home/la/anaconda3/envs/alpaca_torch/bin/python
Traceback (most recent call last):
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/la/anaconda3/envs/alpaca_torch/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
train.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_20:18:47
  host      : guest-server
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 77806)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 77806
======================================================

The text was updated successfully, but these errors were encountered:

xiaoweiweixiao · 2023-03-30T12:36:16Z

When I use multi-GPU to run other codes, I also meet this error. Who can help me?

kasakh · 2023-03-30T16:00:42Z

Can you show what is the command you used to train in multi-GPU environment?

xiaoweiweixiao · 2023-03-31T01:34:01Z

python -m torch.distributed.run --nproc_per_node=4 --master_port=11110 train.py
--model_name_or_path ./output/path
--data_path ./alpaca_data.json
--fp16 True
--output_dir ./pretrained
--num_train_epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 1
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
--tf32 False

ZiweiWangTHU · 2023-04-02T15:48:06Z

Same problem, have you found the solution?

FinalFlowers · 2023-04-03T10:10:50Z

Same problem, hope to have an answer

optimist-lsc · 2023-04-04T08:04:36Z

Please attempt to install the specified version of transformer:
pip install git+https://github.com/zphang/transformers.git
cd transformers
git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176
python setup.py install

xiaoweiweixiao · 2023-04-05T08:07:51Z

git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176

Thank you for your advice. There is no setup.py in the transformers, only a [README.md]. So I can not install transformers.

xv994 · 2023-04-07T01:59:38Z

pip install git+https://github.com/zphang/transformers.git
cd transformers
git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176
pip install .

you can try it, I use it solving problems

xiaoweiweixiao · 2023-04-10T02:46:31Z

pip install git+https://github.com/zphang/transformers.git cd transformers git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176 pip install .

you can try it, I use it solving problems

Thank you for your advice. I can not run "pip install git+https://github.com/zphang/transformers.git", meet this error:

Collecting git+https://github.com/zphang/transformers.git
  Cloning https://github.com/zphang/transformers.git to /tmp/pip-req-build-8bfk9e3m
  Running command git clone --quiet https://github.com/zphang/transformers.git /tmp/pip-req-build-8bfk9e3m
  Resolved https://github.com/zphang/transformers.git to commit 63a9d6745f679b2eb882e0f147828380981111fa
ERROR: git+https://github.com/zphang/transformers.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

I download the transformers from the ”https://github.com/zphang/transformers“ and "cd transformers"
run "git --reset hard 68d640f7c368bcaaaecfc678f11908ebbd3d6176"
meet this error:

Unknown option: --reset
usage: git [--version] [--help] [-C <path>] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

Is there something wrong done by me?

xv994 · 2023-04-11T03:02:38Z

sorry, maybe my suggestion last time was wrong.
Your transformers repository is wrong. Please try follow code, it's my operation:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .
If you are Chinese, you can read this link:https://zhuanlan.zhihu.com/p/618321077 , I followed his steps and succeed.

xiaoweiweixiao · 2023-04-14T08:43:49Z

pip install .

Thank you very much. My problem is solved follow your suggestion.

xiaoweiweixiao · 2023-04-14T09:28:04Z

0041be5

I have another question. I meet the same error in runing other models. This method do not solve my error.
I guess the "0041be5" should change when runing other models (such as GLM130B).
How to change the branch name "0041be5"?

xv994 · 2023-04-14T09:42:09Z

I think you may need different virtual python environment to train different model. And I don't know the version of transformers which GLM130B needs, so you'd better to ask their developer or read their guide.

zhihui-shao · 2023-04-23T15:23:41Z

我按照这个来了，但还是不行，请问还有解决办法吗

codemaster17611 · 2023-04-26T02:23:37Z

Can you tell me the machine configuration which you successfully ran the train.py ？ I meet the same problem, but i have no idea

xv994 · 2023-04-26T02:29:59Z

Can you tell me the machine configuration which you successfully ran the train.py ？ I meet the same problem, but i have no idea

can you show the error you met?

xv994 · 2023-04-26T02:30:22Z

我按照这个来了，但还是不行，请问还有解决办法吗

具体是什么问题呢

codemaster17611 · 2023-04-26T02:40:11Z

codemaster17611 · 2023-04-26T02:42:11Z

Can you tell me the machine configuration which you successfully ran the train.py ？ I meet the same problem, but i have no idea

can you show the error you met?

always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G ， is RAM not enough ? thx for you reply

codemaster17611 · 2023-04-26T02:44:17Z

but i also to monti

Can you tell me the machine configuration which you successfully ran the train.py ？ I meet the same problem, but i have no idea

can you show the error you met?

always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G ， is RAM not enough ? thx for you reply

But I only find 70% of the RAM be used on the backend

xv994 · 2023-04-26T05:43:38Z

oh, my friend, this is not the main reason, you should let me see the exception above this. And your RAM is enough, my device is less than yours.

xv994 · 2023-04-26T05:45:37Z

Have you ever tried this:
git clone https://github.com/huggingface/transformers.git
cd transformers
git checkout 0041be5
pip install .

maybe it works.

Difang233 · 2023-04-26T07:31:57Z

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

Hi, I have tried this method, but still got this problem, do you have any idea about this?
The version of transformers I used is 4.29.0.dev0.
Thanks in advance!

2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00,  2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_07:22:29
  host      : 56de1ccd4f0e
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 9574)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================

codemaster17611 · 2023-04-26T08:32:27Z

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .

maybe it works.

thank for you reply, i have follow you zhihu step by step.

my transformers version

(llmenv3) [xlwu@mochinelearning transformers]$ git checkout 0041be5
HEAD is now at 0041be5b3 LLaMA Implementation (#21955)

Then i run train script:

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
--nproc_per_node=3 \
--master_port=25001 train.py \
--model_name_or_path /DATA/cdisk/xlwu_workspace/pretrain_model/hf-llama-model/llama-7b \
--data_path /DATA/cdisk/xlwu_workspace/data/test.json \
--output_dir /DATA/cdisk/xlwu_workspace/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json

the error show:

    WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : train.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 3
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:25001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:25001.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22302.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22304.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=25001
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2]
  role_ranks=[0, 1, 2]
  global_ranks=[0, 1, 2]
  role_world_sizes=[3, 3, 3]
  global_world_sizes=[3, 3, 3]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/2/error.json
[2023-04-26 16:16:46,603] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29566.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29568.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29570.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29572.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29574.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29576.
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
mochinelearning:3293412:3293412 [0] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293412:3293412 [0] NCCL INFO NET/IB : No device found.
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293412:3293412 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
mochinelearning:3293413:3293413 [1] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293414:3293414 [2] NCCL INFO NET/IB : No device found.
mochinelearning:3293413:3293413 [1] NCCL INFO NET/IB : No device found.
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Using network Socket
mochinelearning:3293413:3293413 [1] NCCL INFO Using network Socket
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00/04 :    0   1   2
mochinelearning:3293414:3293480 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293413:3293481 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01/04 :    0   2   1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02/04 :    0   1   2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03/04 :    0   2   1
mochinelearning:3293412:3293479 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all rings
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all rings
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293480 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293480 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293479 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293479 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293481 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293481 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293414:3293480 [2] NCCL INFO comm 0x7fe1ec002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293481 [1] NCCL INFO comm 0x7fda68002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
mochinelearning:3293412:3293479 [0] NCCL INFO comm 0x7fe718002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
[2023-04-26 16:16:56,205] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00,  2.98s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00/04 :    0   1   2
mochinelearning:3293413:3293877 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293414:3293876 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01/04 :    0   2   1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02/04 :    0   1   2
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03/04 :    0   2   1
mochinelearning:3293412:3293875 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all rings
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293876 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293876 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293875 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293875 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293877 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293877 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO comm 0x7fe59c002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
mochinelearning:3293414:3293876 [2] NCCL INFO comm 0x7fe064002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293877 [1] NCCL INFO comm 0x7fd8e0002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.544737577438354 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598078966140747 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598520278930664 seconds
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/utils...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 15.1830472946167 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 15.121537685394287 seconds
Time to load utils op: 15.222201108932495 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293414 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3293413) of binary: /DATA/xlwu/anconda3/envs/llmenv3/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0019960403442382812 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last):
  File "/DATA/xlwu/anconda3/envs/llmenv3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_16:19:53
  host      : mochinelearning
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 3293413)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 3293413
========================================================

xv994 · 2023-04-27T05:55:57Z

your specific error is: Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
You should make sure the version of cuda and torch is matching

xv994 · 2023-04-27T05:59:32Z

Have you ever tried this: git clone https://github.com/huggingface/transformers.git cd transformers git checkout 0041be5 pip install .
maybe it works.

Hi, I have tried this method, but still got this problem, do you have any idea about this? The version of transformers I used is 4.29.0.dev0. Thanks in advance!

2023-04-26 07:19:29.474990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2023-04-26 07:19:35,696] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-04-26 07:19:53,218] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100% 33/33 [01:08<00:00,  2.09s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.060922384262085 seconds
Using /root/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu118/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.49018430709838867 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 9574) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
/content/drive/MyDrive/codealpaca/train.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-26_07:22:29
  host      : 56de1ccd4f0e
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 9574)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 9574
=====================================================

my transformers version is4.28.0.dev0
maybe something wrong, please check your step

xv994 · 2023-04-27T06:00:27Z

如果大家可以用中文交流或许可以减少一些沟通成本...

Lvzhh · 2023-05-03T07:10:19Z

This error might occur when there is not enough RAM available. It can happen when using FSDP multiprocess and transformers "from_pretrained" method where each process loads the checkpoint. As a result, the memory usage becomes num_processes * (model_size + size_of_largest_shard), leading to process crashes.

To tackle this issue, we can use DeepSpeed instead of FSDP. DeepSpeed optimizes initialization CPU memory usage, and it only uses num_processes * size_of_largest_shard RAM.

newstronger · 2023-05-14T08:45:38Z

The error show:
Traceback (most recent call last):
File "tools/train.py", line 194, in
main()
File "tools/train.py", line 183, in main
train_detector(
File "/home/wangzhang/mmrotate/mmrotate/apis/train.py", line 144, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 119, in new_func
return old_func(*args, **kwargs)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/wangzhang/mmrotate/mmrotate/models/detectors/single_stage.py", line 81, in forward_train
losses = self.bbox_head.forward_train(x, img_metas, gt_bboxes,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/models/dense_heads/base_dense_head.py", line 335, in forward_train
losses = self.loss(loss_inputs, gt_bboxes_ignore=gt_bboxes_ignore)
File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 952, in loss
quality_assess_list, = multi_apply(
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(map_results)))
File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 480, in pointsets_quality_assessment
sampling_pts_pred_init = self.sampling_points(
File "/home/wangzhang/mmrotate/mmrotate/models/dense_heads/oriented_reppoints_head.py", line 342, in sampling_points
ratio = torch.linspace(0, 1, points_num).to(device).repeat(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7fdfc57631ee in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x23a21 (0x7fdfede06a21 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void) + 0x257 (0x7fdfede0b977 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x463418 (0x7fe017356418 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7fdfc574a7a5 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x35f2f5 (0x7fe0172522f5 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x679288 (0x7fe01756c288 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object) + 0x2d5 (0x7fe01756c655 in /home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4d38df]
frame #9: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e55ab]
frame #10: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e55ab]
frame #11: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4e07e0]
frame #12: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f1908]
frame #13: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #14: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #15: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #16: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #17: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #18: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4f18f1]
frame #19: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x4c91b0]
frame #20: PyDict_SetItemString + 0x52 (0x5819d2 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #21: PyImport_Cleanup + 0x93 (0x5a6b73 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #22: Py_FinalizeEx + 0x71 (0x5a5ca1 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #23: Py_RunMain + 0x112 (0x5a1972 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #24: Py_BytesMain + 0x39 (0x579dd9 in /home/wangzhang/anaconda3/envs/open-mmlab/bin/python)
frame #25: __libc_start_main + 0xe7 (0x7fe02f882c87 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python() [0x579c8d]

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17013 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17014 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17015 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 17016) of binary: /home/wangzhang/anaconda3/envs/open-mmlab/bin/python
Traceback (most recent call last):
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/wangzhang/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-05-14_16:36:57
host : user-SYS-7049GP-TRT
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 17016)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 17016

Kangzf1996 · 2023-06-01T09:53:54Z

如果大家可以用中文交流或许可以减少一些沟通成本...

你好，我现在有这个error，请问你知道是什么原因吗？我的transformers是4.28.0.dev0

[2023-06-01 09:44:26,442] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-06-01 09:44:38,504] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:50<00:00,  1.53s/it]
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 3.2360477447509766 seconds
Using /root/.cache/torch_extensions/py38_cu117 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.3748812675476074 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 308887) of binary: /opt/conda/bin/python3.8
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-01_09:47:02
  host      : alpaca-6655dbbbc6-btc9j
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 308887)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 308887
=======================================================

rbareja25 · 2023-09-21T00:13:27Z

Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version but nothing works..

TX-Yeager · 2024-02-27T13:43:52Z

when u train the model, it will take a lot of time. The process might be killed by sigup.
so don't just use python or torchrun. try this :
nohup python xxx.py &

yanqiangmiffy mentioned this issue Apr 11, 2023

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

xiaoweiweixiao commented Mar 30, 2023 •

edited

Loading

xiaoweiweixiao commented Mar 30, 2023

kasakh commented Mar 30, 2023

xiaoweiweixiao commented Mar 31, 2023

ZiweiWangTHU commented Apr 2, 2023

FinalFlowers commented Apr 3, 2023

optimist-lsc commented Apr 4, 2023 •

edited

Loading

xiaoweiweixiao commented Apr 5, 2023

xv994 commented Apr 7, 2023

xiaoweiweixiao commented Apr 10, 2023

xv994 commented Apr 11, 2023

xiaoweiweixiao commented Apr 14, 2023

xiaoweiweixiao commented Apr 14, 2023

xv994 commented Apr 14, 2023

zhihui-shao commented Apr 23, 2023

codemaster17611 commented Apr 26, 2023

xv994 commented Apr 26, 2023

xv994 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023

xv994 commented Apr 26, 2023

xv994 commented Apr 26, 2023 •

edited

Loading

Difang233 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023 •

edited

Loading

xv994 commented Apr 27, 2023

xv994 commented Apr 27, 2023

xv994 commented Apr 27, 2023

Lvzhh commented May 3, 2023

newstronger commented May 14, 2023

Kangzf1996 commented Jun 1, 2023

rbareja25 commented Sep 21, 2023

TX-Yeager commented Feb 27, 2024

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162

Comments

xiaoweiweixiao commented Mar 30, 2023 • edited Loading

xiaoweiweixiao commented Mar 30, 2023

kasakh commented Mar 30, 2023

xiaoweiweixiao commented Mar 31, 2023

ZiweiWangTHU commented Apr 2, 2023

FinalFlowers commented Apr 3, 2023

optimist-lsc commented Apr 4, 2023 • edited Loading

xiaoweiweixiao commented Apr 5, 2023

xv994 commented Apr 7, 2023

xiaoweiweixiao commented Apr 10, 2023

xv994 commented Apr 11, 2023

xiaoweiweixiao commented Apr 14, 2023

xiaoweiweixiao commented Apr 14, 2023

xv994 commented Apr 14, 2023

zhihui-shao commented Apr 23, 2023

codemaster17611 commented Apr 26, 2023

xv994 commented Apr 26, 2023

xv994 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023

xv994 commented Apr 26, 2023

xv994 commented Apr 26, 2023 • edited Loading

Difang233 commented Apr 26, 2023

codemaster17611 commented Apr 26, 2023 • edited Loading

xv994 commented Apr 27, 2023

xv994 commented Apr 27, 2023

xv994 commented Apr 27, 2023

Lvzhh commented May 3, 2023

newstronger commented May 14, 2023

tools/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2023-05-14_16:36:57 host : user-SYS-7049GP-TRT rank : 3 (local_rank: 3) exitcode : -6 (pid: 17016) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 17016

Kangzf1996 commented Jun 1, 2023

rbareja25 commented Sep 21, 2023

TX-Yeager commented Feb 27, 2024

xiaoweiweixiao commented Mar 30, 2023 •

edited

Loading

optimist-lsc commented Apr 4, 2023 •

edited

Loading

xv994 commented Apr 26, 2023 •

edited

Loading

codemaster17611 commented Apr 26, 2023 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-05-14_16:36:57
host : user-SYS-7049GP-TRT
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 17016)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 17016