-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error of multi-GPU: torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 #162
Comments
When I use multi-GPU to run other codes, I also meet this error. Who can help me? |
Can you show what is the command you used to train in multi-GPU environment? |
python -m torch.distributed.run --nproc_per_node=4 --master_port=11110 train.py |
Same problem, have you found the solution? |
Same problem, hope to have an answer |
Please attempt to install the specified version of transformer: |
Thank you for your advice. There is no setup.py in the transformers, only a [README.md]. So I can not install transformers. |
pip install git+https://github.com/zphang/transformers.git you can try it, I use it solving problems |
Thank you for your advice. I can not run "pip install git+https://github.com/zphang/transformers.git", meet this error:
I download the transformers from the ”https://github.com/zphang/transformers“ and "cd transformers"
Is there something wrong done by me? |
sorry, maybe my suggestion last time was wrong. |
Thank you very much. My problem is solved follow your suggestion. |
I have another question. I meet the same error in runing other models. This method do not solve my error. |
I think you may need different virtual python environment to train different model. And I don't know the version of transformers which GLM130B needs, so you'd better to ask their developer or read their guide. |
我按照这个来了,但还是不行,请问还有解决办法吗 |
Can you tell me the machine configuration which you successfully ran the train.py ? I meet the same problem, but i have no idea |
can you show the error you met? |
具体是什么问题呢 |
always show exitcode -9, my config: GPU V100 16G * 3, CPU RAM 128G , is RAM not enough ? thx for you reply |
but i also to monti
But I only find 70% of the RAM be used on the backend |
Have you ever tried this: maybe it works. |
Hi, I have tried this method, but still got this problem, do you have any idea about this?
|
thank for you reply, i have follow you zhihu step by step. my transformers version
Then i run train script: TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO LOGLEVEL=INFO CUDA_VISIBLE_DEVICES=0,1,2 torchrun \
--nproc_per_node=3 \
--master_port=25001 train.py \
--model_name_or_path /DATA/cdisk/xlwu_workspace/pretrain_model/hf-llama-model/llama-7b \
--data_path /DATA/cdisk/xlwu_workspace/data/test.json \
--output_dir /DATA/cdisk/xlwu_workspace/output/alpaca/sft_7b \
--per_device_eval_batch_size 1 \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1000 \
--save_total_limit 1 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--report_to "tensorboard" \
--gradient_checkpointing True \
--fp16 True \
--deepspeed ds_config.json the error show: WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : train.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 3
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:25001
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 0
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[I socket.cpp:522] [c10d] The server socket has started to listen on [::]:25001.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22302.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:22304.
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=25001
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[3, 3, 3]
global_world_sizes=[3, 3, 3]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_6we5ifkl/none_yi1eb8vm/attempt_0/2/error.json
[2023-04-26 16:16:46,603] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29566.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29568.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29570.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29572.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29574.
[I socket.cpp:725] [c10d] The client socket has connected to [localhost]:25001 on [localhost]:29576.
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
mochinelearning:3293412:3293412 [0] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293412:3293412 [0] NCCL INFO NET/IB : No device found.
mochinelearning:3293412:3293412 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293412:3293412 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
mochinelearning:3293413:3293413 [1] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Bootstrap : Using eth0:10.1.118.59<0>
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
mochinelearning:3293414:3293414 [2] NCCL INFO NET/IB : No device found.
mochinelearning:3293413:3293413 [1] NCCL INFO NET/IB : No device found.
mochinelearning:3293414:3293414 [2] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293413:3293413 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.118.59<0> [1]br-e7526d7e228d:192.168.16.1<0> [2]br-1e30049ed87b:192.168.32.1<0> [3]br-100d386c1c63:192.168.48.1<0>
mochinelearning:3293414:3293414 [2] NCCL INFO Using network Socket
mochinelearning:3293413:3293413 [1] NCCL INFO Using network Socket
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00/04 : 0 1 2
mochinelearning:3293414:3293480 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293413:3293481 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01/04 : 0 2 1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02/04 : 0 1 2
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03/04 : 0 2 1
mochinelearning:3293412:3293479 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all rings
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all rings
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293479 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293481 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293480 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293480 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293480 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293479 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293479 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293479 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293481 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293481 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293481 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293414:3293480 [2] NCCL INFO comm 0x7fe1ec002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293481 [1] NCCL INFO comm 0x7fda68002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
mochinelearning:3293412:3293479 [0] NCCL INFO comm 0x7fe718002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
[2023-04-26 16:16:56,205] [INFO] [partition_parameters.py:413:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00, 2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00, 2.98s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████| 33/33 [01:38<00:00, 2.98s/it]
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Loading data...
WARNING:root:Loading data...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
WARNING:root:Formatting inputs...
WARNING:root:Tokenizing inputs... This may take some time...
[I ProcessGroupNCCL.cpp:587] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 2] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 1] NCCL watchdog thread started!
[I ProcessGroupNCCL.cpp:587] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 1
NCCL_DESYNC_DEBUG: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
[I ProcessGroupNCCL.cpp:751] [Rank 0] NCCL watchdog thread started!
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00/04 : 0 1 2
mochinelearning:3293413:3293877 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 0/-1/-1->1->2 [2] 2/-1/-1->1->0 [3] 0/-1/-1->1->2
mochinelearning:3293414:3293876 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] 1/-1/-1->2->-1 [2] -1/-1/-1->2->1 [3] 1/-1/-1->2->-1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01/04 : 0 2 1
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02/04 : 0 1 2
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03/04 : 0 2 1
mochinelearning:3293412:3293875 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 00 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 02 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 01 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 03 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all rings
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all rings
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 01 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293412:3293875 [0] NCCL INFO Channel 03 : 0[80] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 01 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 03 : 1[90] -> 2[a0] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 00 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Channel 02 : 2[a0] -> 1[90] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 00 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293413:3293877 [1] NCCL INFO Channel 02 : 1[90] -> 0[80] via P2P/IPC
mochinelearning:3293414:3293876 [2] NCCL INFO Connected all trees
mochinelearning:3293414:3293876 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293414:3293876 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO Connected all trees
mochinelearning:3293412:3293875 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293412:3293875 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293413:3293877 [1] NCCL INFO Connected all trees
mochinelearning:3293413:3293877 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 8/8/512
mochinelearning:3293413:3293877 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mochinelearning:3293412:3293875 [0] NCCL INFO comm 0x7fe59c002fb0 rank 0 nranks 3 cudaDev 0 busId 80 - Init COMPLETE
mochinelearning:3293414:3293876 [2] NCCL INFO comm 0x7fe064002fb0 rank 2 nranks 3 cudaDev 2 busId a0 - Init COMPLETE
mochinelearning:3293413:3293877 [1] NCCL INFO comm 0x7fd8e0002fb0 rank 1 nranks 3 cudaDev 1 busId 90 - Init COMPLETE
[I ProcessGroupNCCL.cpp:1196] NCCL_DEBUG: INFO
mochinelearning:3293412:3293412 [0] NCCL INFO Launch mode Parallel
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.2 does not match the version torch was compiled with 11.3 but since the APIs are compatible, accepting this combination
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.544737577438354 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598078966140747 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 30.598520278930664 seconds
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Creating extension directory /home/xlwu/.cache/torch_extensions/py310_cu113/utils...
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
Emitting ninja build file /home/xlwu/.cache/torch_extensions/py310_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Using /home/xlwu/.cache/torch_extensions/py310_cu113 as PyTorch extensions root...
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/TH -isystem /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/include/THC -isystem /DATA/xlwu/anconda3/envs/llmenv3/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o
[2/2] c++ flatten_unflatten.o -shared -L/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 15.1830472946167 seconds
Loading extension module utils...
Loading extension module utils...
Time to load utils op: 15.121537685394287 seconds
Time to load utils op: 15.222201108932495 seconds
Parameter Offload: Total persistent parameters: 0 in 0 params
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293412 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3293414 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 3293413) of binary: /DATA/xlwu/anconda3/envs/llmenv3/bin/python
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0019960403442382812 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/DATA/xlwu/anconda3/envs/llmenv3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/DATA/xlwu/anconda3/envs/llmenv3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-26_16:19:53
host : mochinelearning
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 3293413)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 3293413
======================================================== |
your specific error is: |
my transformers version is4.28.0.dev0 |
如果大家可以用中文交流或许可以减少一些沟通成本... |
This error might occur when there is not enough RAM available. It can happen when using FSDP multiprocess and transformers "from_pretrained" method where each process loads the checkpoint. As a result, the memory usage becomes To tackle this issue, we can use DeepSpeed instead of FSDP. DeepSpeed optimizes initialization CPU memory usage, and it only uses |
The error show: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17013 closing signal SIGTERM
|
你好,我现在有这个error,请问你知道是什么原因吗?我的transformers是4.28.0.dev0
|
Has anyone able to fix this? I have increased RAM, reduced batch size, downgraded torch vision version but nothing works.. |
when u train the model, it will take a lot of time. The process might be killed by sigup. |
When I use four GPU to train the model, I meet this error, can anybody help me slove this error? Thank you very much.
The text was updated successfully, but these errors were encountered: