[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #1232

zxy333666 · 2023-06-13T02:00:37Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config.
Map:  20%|████████████████████████▉                                                                                                 | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in                              │
│                                                                                                  │
│   437                                                                                            │
│   438                                                                                            │
│   439 if name == "main":                                                                 │
│ ❱ 440 │   main()                                                                                 │
│   441                                                                                            │
│                                                                                                  │
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:251 in main                                 │
│                                                                                                  │
│   248 │   │   if data_args.max_train_samples is not None:                                        │
│   249 │   │   │   max_train_samples = min(len(train_dataset), data_args.max_train_samples)       │
│   250 │   │   │   train_dataset = train_dataset.select(range(max_train_samples))                 │
│ ❱ 251 │   │   with training_args.main_process_first(desc="train dataset map pre-processing"):    │
│   252 │   │   │                                                                                  │
│   253 │   │   │   train_dataset = train_dataset.map(                                             │
│   254 │   │   │   │   preprocess_function_train,                                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/contextlib.py:135 in enter                                         │
│                                                                                                  │
│   132 │   │   # they are only needed for recreation, which is not possible anymore               │
│   133 │   │   del self.args, self.kwds, self.func                                                │
│   134 │   │   try:                                                                               │
│ ❱ 135 │   │   │   return next(self.gen)                                                          │
│   136 │   │   except StopIteration:                                                              │
│   137 │   │   │   raise RuntimeError("generator didn't yield") from None                         │
│   138                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1888 in main_process_first │
│                                                                                                  │
│   1885 │   │   │   │   │   elif is_sagemaker_dp_enabled():                                       │
│   1886 │   │   │   │   │   │   dist.barrier()                                                    │
│   1887 │   │   │   │   │   else:                                                                 │
│ ❱ 1888 │   │   │   │   │   │   torch.distributed.barrier()                                       │
│   1889 │   │   │   │   yield                                                                     │
│   1890 │   │   │   finally:                                                                      │
│   1891 │   │   │   │   if is_main_process:                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:3145 in barrier    │
│                                                                                                  │
│   3142 │                                                                                         │
│   3143 │   if group is None:                                                                     │
│   3144 │   │   default_pg = _get_default_group()                                                 │
│ ❱ 3145 │   │   work = default_pg.barrier(opts=opts)                                              │
│   3146 │   else:                                                                                 │
│   3147 │   │   work = group.barrier(opts=opts)                                                   │
│   3148                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: + 0x1be3a3 (0x7f79d31fd3a3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f79d31fe721 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)

Expected Behavior

No response

Steps To Reproduce

1、ptuning数据133万条
2、两张A100显卡

Environment

- OS:
- Python:3.10.8
- Transformers: 4.28.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True


1、ptuning数据133万条
2、两张A100显卡

Anything else?

No response

lhy101 · 2023-06-25T17:26:23Z

同问，出现了同样的问题~

zzoneee · 2023-06-26T06:29:19Z

同问，出现了同样的问题~

我是因为max_source_length和max_target_length设得太大了才出现这个问题

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #1232

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #1232

zxy333666 commented Jun 13, 2023

lhy101 commented Jun 25, 2023

zzoneee commented Jun 26, 2023

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #1232

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #1232

Comments

zxy333666 commented Jun 13, 2023

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

lhy101 commented Jun 25, 2023

zzoneee commented Jun 26, 2023