You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG/Help] <title> RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
#1232
Open
1 task done
zxy333666 opened this issue
Jun 13, 2023
· 2 comments
WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config.
Map: 20%|████████████████████████▉ | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in │
│ │
│ 437 │
│ 438 │
│ 439 if name == "main": │
│ ❱ 440 │ main() │
│ 441 │
│ │
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:251 in main │
│ │
│ 248 │ │ if data_args.max_train_samples is not None: │
│ 249 │ │ │ max_train_samples = min(len(train_dataset), data_args.max_train_samples) │
│ 250 │ │ │ train_dataset = train_dataset.select(range(max_train_samples)) │
│ ❱ 251 │ │ with training_args.main_process_first(desc="train dataset map pre-processing"): │
│ 252 │ │ │ │
│ 253 │ │ │ train_dataset = train_dataset.map( │
│ 254 │ │ │ │ preprocess_function_train, │
│ │
│ /opt/conda/lib/python3.10/contextlib.py:135 in enter │
│ │
│ 132 │ │ # they are only needed for recreation, which is not possible anymore │
│ 133 │ │ del self.args, self.kwds, self.func │
│ 134 │ │ try: │
│ ❱ 135 │ │ │ return next(self.gen) │
│ 136 │ │ except StopIteration: │
│ 137 │ │ │ raise RuntimeError("generator didn't yield") from None │
│ 138 │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1888 in main_process_first │
│ │
│ 1885 │ │ │ │ │ elif is_sagemaker_dp_enabled(): │
│ 1886 │ │ │ │ │ │ dist.barrier() │
│ 1887 │ │ │ │ │ else: │
│ ❱ 1888 │ │ │ │ │ │ torch.distributed.barrier() │
│ 1889 │ │ │ │ yield │
│ 1890 │ │ │ finally: │
│ 1891 │ │ │ │ if is_main_process: │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:3145 in barrier │
│ │
│ 3142 │ │
│ 3143 │ if group is None: │
│ 3144 │ │ default_pg = _get_default_group() │
│ ❱ 3145 │ │ work = default_pg.barrier(opts=opts) │
│ 3146 │ else: │
│ 3147 │ │ work = group.barrier(opts=opts) │
│ 3148 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: + 0x1be3a3 (0x7f79d31fd3a3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f79d31fe721 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
Is there an existing issue for this?
Current Behavior
WARNING|modeling_utils.py:3192] 2023-06-12 14:17:57,899 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[WARNING|modeling_utils.py:3192] 2023-06-12 14:17:58,004 >> Some weights of ChatGLMForConditionalGeneration were not initialized from the model checkpoint at /data/chatglm-6b-int4 and are newly initialized: ['transformer.prefix_encoder.embedding.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:2839] 2023-06-12 14:17:58,032 >> Generation config file not found, using a generation config created from the model config.
Map: 20%|████████████████████████▉ | 273000/1332406 [32:01<2:03:36, 142.84 examples/s]╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:440 in │
│ │
│ 437 │
│ 438 │
│ 439 if name == "main": │
│ ❱ 440 │ main() │
│ 441 │
│ │
│ /data/chatglm/chatglm0523/ChatGLM-6B/ptuning/main.py:251 in main │
│ │
│ 248 │ │ if data_args.max_train_samples is not None: │
│ 249 │ │ │ max_train_samples = min(len(train_dataset), data_args.max_train_samples) │
│ 250 │ │ │ train_dataset = train_dataset.select(range(max_train_samples)) │
│ ❱ 251 │ │ with training_args.main_process_first(desc="train dataset map pre-processing"): │
│ 252 │ │ │ │
│ 253 │ │ │ train_dataset = train_dataset.map( │
│ 254 │ │ │ │ preprocess_function_train, │
│ │
│ /opt/conda/lib/python3.10/contextlib.py:135 in enter │
│ │
│ 132 │ │ # they are only needed for recreation, which is not possible anymore │
│ 133 │ │ del self.args, self.kwds, self.func │
│ 134 │ │ try: │
│ ❱ 135 │ │ │ return next(self.gen) │
│ 136 │ │ except StopIteration: │
│ 137 │ │ │ raise RuntimeError("generator didn't yield") from None │
│ 138 │
│ │
│ /opt/conda/lib/python3.10/site-packages/transformers/training_args.py:1888 in main_process_first │
│ │
│ 1885 │ │ │ │ │ elif is_sagemaker_dp_enabled(): │
│ 1886 │ │ │ │ │ │ dist.barrier() │
│ 1887 │ │ │ │ │ else: │
│ ❱ 1888 │ │ │ │ │ │ torch.distributed.barrier() │
│ 1889 │ │ │ │ yield │
│ 1890 │ │ │ finally: │
│ 1891 │ │ │ │ if is_main_process: │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:3145 in barrier │
│ │
│ 3142 │ │
│ 3143 │ if group is None: │
│ 3144 │ │ default_pg = _get_default_group() │
│ ❱ 3145 │ │ work = default_pg.barrier(opts=opts) │
│ 3146 │ else: │
│ 3147 │ │ work = group.barrier(opts=opts) │
│ 3148 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/Utils.hpp:594 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f799348f457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f79934594b5 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRefstd::string, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f79ca652918 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f79ca6535c2 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f79ca653649 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f79ca623e21 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xab (0x7f79d31f1edb in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #8: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocatorc10::Device > const&, c10d::OpType, int, bool) + 0x202 (0x7f79d31f63a2 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #9: + 0x1be3a3 (0x7f79d31fd3a3 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
frame #10: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10d::AllreduceOptions const&) + 0x21 (0x7f79d31fe721 in
/opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cpp.so)
Expected Behavior
No response
Steps To Reproduce
1、ptuning数据133万条
2、两张A100显卡
Environment
Anything else?
No response
The text was updated successfully, but these errors were encountered: