Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout #1232

Open
1 task done
zxy333666 opened this issue Jun 13, 2023 · 2 comments

Comments

@zxy333666
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config.
Map:  20%|████████████████████████▉                                                                                                 | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in                              │
│                                                                                                  │
│   437                                                                                            │
│   438                                                                                            │
│   439 if name == "main":                                                                 │
│ ❱ 440 │   main()                                                                                 │
│   441                                                                                            │
│                                                                                                  │
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:251 in main                                 │
│                                                                                                  │
│   248 │   │   if data_args.max_train_samples is not None:                                        │
│   249 │   │   │   max_train_samples = min(len(train_dataset), data_args.max_train_samples)       │
│   250 │   │   │   train_dataset = train_dataset.select(range(max_train_samples))                 │
│ ❱ 251 │   │   with training_args.main_process_first(desc="train dataset map pre-processing"):    │
│   252 │   │   │                                                                                  │
│   253 │   │   │   train_dataset = train_dataset.map(                                             │
│   254 │   │   │   │   preprocess_function_train,                                                 │
│                                                                                                  │
│ /opt/conda/lib/python3.10/contextlib.py:135 in enter                                         │
│                                                                                                  │
│   132 │   │   # they are only needed for recreation, which is not possible anymore               │
│   133 │   │   del self.args, self.kwds, self.func                                                │
│   134 │   │   try:                                                                               │
│ ❱ 135 │   │   │   return next(self.gen)                                                          │
│   136 │   │   except StopIteration:                                                              │
│   137 │   │   │   raise RuntimeError("generator didn't yield") from None                         │
│   138                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1888 in main_process_first │
│                                                                                                  │
│   1885 │   │   │   │   │   elif is_sagemaker_dp_enabled():                                       │
│   1886 │   │   │   │   │   │   dist.barrier()                                                    │
│   1887 │   │   │   │   │   else:                                                                 │
│ ❱ 1888 │   │   │   │   │   │   torch.distributed.barrier()                                       │
│   1889 │   │   │   │   yield                                                                     │
│   1890 │   │   │   finally:                                                                      │
│   1891 │   │   │   │   if is_main_process:                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:3145 in barrier    │
│                                                                                                  │
│   3142 │                                                                                         │
│   3143 │   if group is None:                                                                     │
│   3144 │   │   default_pg = _get_default_group()                                                 │
│ ❱ 3145 │   │   work = default_pg.barrier(opts=opts)                                              │
│   3146 │   else:                                                                                 │
│   3147 │   │   work = group.barrier(opts=opts)                                                   │
│   3148                                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯

RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: + 0x1be3a3 (0x7f79d31fd3a3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f79d31fe721 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)

Expected Behavior

No response

Steps To Reproduce

1、ptuning数据133万条
2、两张A100显卡

Environment

- OS:
- Python:3.10.8
- Transformers: 4.28.1
- PyTorch:1.13.1
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) : True


1、ptuning数据133万条
2、两张A100显卡

Anything else?

No response

@lhy101
Copy link

lhy101 commented Jun 25, 2023

同问,出现了同样的问题~

@zzoneee
Copy link

zzoneee commented Jun 26, 2023

同问,出现了同样的问题~

我是因为max_source_length和max_target_length设得太大了才出现这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants