You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
multi-GPU offline inference
And When I try to run multi-GPU offline inference, it returns an error: the actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.
the detail error log:
cat /tmp/ray/session_latest/logs/python-core-worker-02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1_42067.log
[2023-07-11 15:20:05,144 I 42067 42067] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 42067
[2023-07-11 15:20:05,145 I 42067 42067] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-07-11 15:20:05,147 I 42067 42067] grpc_server.cc:140: worker server started, listening on port 43292.
[2023-07-11 15:20:05,151 I 42067 42067] core_worker.cc:217: Initializing worker at address: 172.25.1.37:43292, worker ID 02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1, raylet 2ec03b7f681a8bef6ffa0e368c13890542e6f29eaa380cf40026fb26
[2023-07-11 15:20:05,152 I 42067 42067] task_event_buffer.cc:184: Reporting task events to GCS every 1000ms.
[2023-07-11 15:20:05,152 I 42067 42067] core_worker.cc:605: Adjusted worker niceness to 15
[2023-07-11 15:20:05,152 I 42067 42253] core_worker.cc:553: Event stats:
Global stats: 12 total (8 active)
Queueing time: mean = 5.740 us, max = 49.684 us, min = 6.046 us, total = 68.876 us
Execution time: mean = 11.391 us, total = 136.696 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 4.235 us, total = 25.408 us
UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 33.060 us, total = 33.060 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 78.228 us, total = 78.228 us
Task Event stats:
IO Service Stats:
Global stats: 2 total (1 active)
Queueing time: mean = 2.611 us, max = 5.223 us, min = 5.223 us, total = 5.223 us
Execution time: mean = 3.402 us, total = 6.805 us
Event stats:
CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), CPU time: mean = 6.805 us, total = 6.805 us
Other Stats:
grpc_in_progress:0
current number of task events in buffer: 0
total task events sent: 0 MiB
total number of task events sent: 0
num status task events dropped: 0
num profile task events dropped: 0
[2023-07-11 15:20:05,152 I 42067 42067] event.cc:234: Set ray event level to warning
[2023-07-11 15:20:05,152 I 42067 42067] event.cc:342: Ray Event initialized for CORE_WORKER
[2023-07-11 15:20:05,152 I 42067 42253] accessor.cc:611: Received notification for node id = 2ec03b7f681a8bef6ffa0e368c13890542e6f29eaa380cf40026fb26, IsAlive = 1
[2023-07-11 15:20:05,152 I 42067 42253] core_worker.cc:4134: Number of alive nodes:1
[2023-07-11 15:20:05,159 I 42067 42067] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor b66dd417878bbd0cda7ea4e601000000
[2023-07-11 15:20:05,159 I 42067 42067] direct_actor_task_submitter.cc:237: Connecting to actor b66dd417878bbd0cda7ea4e601000000 at worker 02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1
[2023-07-11 15:20:05,159 I 42067 42067] core_worker.cc:2673: Creating actor: b66dd417878bbd0cda7ea4e601000000
[2023-07-11 15:20:15,496 W 42067 42248] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnknown: RPC Error message: Method not found!; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-07-11 15:20:41,680 E 42067 42067] logging.cc:97: Unhandled exception: St13runtime_error. what(): NCCL Error 5: invalid usage
[2023-07-11 15:20:41,788 E 42067 42067] logging.cc:104: Stack trace:
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdc551a) [0x7f7f16a1d51a] ray::operator<<()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdc7cd8) [0x7f7f16a1fcd8] ray::TerminateHandler()
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f7f158e035a] __cxxabiv1::__terminate()
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(+0xb03b9) [0x7f7f158df3b9]
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x87) [0x7f7f158dfae7] __gxx_personality_v0
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libgcc_s.so.1(+0x111e4) [0x7f7f158261e4] _Unwind_RaiseException_Phase2
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libgcc_s.so.1(_Unwind_Resume+0x12e) [0x7f7f15826c1e] _Unwind_Resume
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc6ecc3) [0x7f4f2788ccc3] torch::cuda::nccl::detail::throw_nccl_error()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xeb2077) [0x7f4f27ad0077] torch::cuda::nccl::AutoNcclGroup::~AutoNcclGroup()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc72417) [0x7f4f27890417] c10d::ProcessGroupNCCL::collective<>()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(ZN4c10d16ProcessGroupNCCL14allreduce_implERSt6vectorIN2at6TensorESaIS3_EERKNS_16AllreduceOptionsE+0x21) [0x7f4f27ae9e31] c10d::ProcessGroupNCCL::allreduce_impl()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(ZN4c10d16ProcessGroupNCCL9allreduceERSt6vectorIN2at6TensorESaIS3_EERKNS_16AllreduceOptionsE+0x39d) [0x7f4f27aecaed] c10d::ProcessGroupNCCL::allreduce()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x538ec64) [0x7f4f5206cc64] c10d::ops::allreduce_cuda()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x53917cf) [0x7f4f5206f7cf] c10::impl::wrap_kernel_functor_unboxed<>::call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x53ab0e3) [0x7f4f520890e3] c10d::ProcessGroup::allreduce()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xb6cc21) [0x7f4f666fdc21] pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x3b7040) [0x7f4f65f48040] pybind11::cpp_function::dispatcher()
ray::Worker.init(PyCFunction_Call+0x52) [0x4f5652] PyCFunction_Call
ray::Worker.init(_PyObject_MakeTpCall+0x3bb) [0x4e0c8b] _PyObject_MakeTpCall
ray::Worker.init() [0x4f53fd] method_vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x49a9) [0x4dc999] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x34e) [0x4f778e] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x1150) [0x4d9140] _PyEval_EvalFrameDefault
ray::Worker.init(_PyFunction_Vectorcall+0x106) [0x4e7fe6] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x6b2) [0x4d86a2] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyObject_FastCallDict+0x282) [0x4e0442] _PyObject_FastCallDict
ray::Worker.init() [0x4f1cd3] slot_tp_init
ray::Worker.init(_PyObject_MakeTpCall+0x3d3) [0x4e0ca3] _PyObject_MakeTpCall
ray::Worker.init(_PyEval_EvalFrameDefault+0x4fa6) [0x4dcf96] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x399) [0x4d8389] _PyEval_EvalFrameDefault
ray::Worker.init(_PyFunction_Vectorcall+0x106) [0x4e7fe6] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x24a) [0x4f768a] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x24a) [0x4f768a] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_FastCallDict+0x1d9) [0x4a6682] _PyFunction_FastCallDict
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x4f1f12) [0x7f7f16149f12] __Pyx_PyObject_Call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x53818b) [0x7f7f1619018b] __pyx_pw_3ray_7_raylet_12execute_task_3function_executor()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x4f1f12) [0x7f7f16149f12] __Pyx_PyObject_Call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x5a4091) [0x7f7f161fc091] __pyx_f_3ray_7_raylet_task_execution_handler()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_RSO_INS0_17LocalMemoryBufferEEPbPSsRKSN_INS0_16ConcurrencyGroupESaIS1B_EESsbbEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S18_S19_S1A_S1F_SsbbEE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1P_S18_OS19_OS1A_S1F_S1O_ObS1S+0x12b) [0x7f7f16150ecb] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0xb7f) [0x7f7f1630da5f] ray::core::CoreWorker::ExecuteTask()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDES5_INS0_9RayObjectEEESaISO_EESR_PN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_RKSK_SR_SR_SY_SZ_S10_EPS14_St12_PlaceholderILi1EES1A_ILi2EES1A_ILi3EES1A_ILi4EES1A_ILi5EES1A_ILi6EES1A_ILi7EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSR_S1P_OSY_OSZ_OS10+0x54) [0x7f7f1624ee14] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6de67e) [0x7f7f1633667e] ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6df7da) [0x7f7f163377da] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6f25fe) [0x7f7f1634a5fe] ray::core::InboundRequest::Accept()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6c63d0) [0x7f7f1631e3d0] ray::core::NormalSchedulingQueue::ScheduleRequests()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x97e876) [0x7f7f165d6876] EventTracker::RecordExecution()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x91bcce) [0x7f7f16573cce] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x91c226) [0x7f7f16574226] boost::asio::detail::completion_handler<>::do_complete()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdd706b) [0x7f7f16a2f06b] boost::asio::detail::scheduler::do_run_one()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdd8b39) [0x7f7f16a30b39] boost::asio::detail::scheduler::run()
[2023-07-11 15:20:41,793 E 42067 42067] logging.cc:361: *** SIGABRT received at time=1689060041 on cpu 34 ***
[2023-07-11 15:20:41,793 E 42067 42067] logging.cc:361: PC: @ 0x7f7f1d831387 (unknown) raise
[2023-07-11 15:20:41,795 E 42067 42067] logging.cc:361: @ 0x7f7f1e2e1630 (unknown) (unknown)
[2023-07-11 15:20:41,795 E 42067 42067] logging.cc:361: @ 0x7f7f158e035a 992 __cxxabiv1::__terminate()
[2023-07-11 15:20:41,797 E 42067 42067] logging.cc:361: @ 0x7ffca2d9acf0 248 (unknown)
[2023-07-11 15:20:41,800 E 42067 42067] logging.cc:361: @ 0x7f7f1dbc27b8 (unknown) (unknown)
[2023-07-11 15:20:41,800 E 42067 42067] logging.cc:361: @ ... and at least 2 more frames
The text was updated successfully, but these errors were encountered:
Hi @murongweibo, thanks for reporting the bug. And thanks @wehos for sharing your experience!
@murongweibo is your problem solved now? As @wehos mentioned, your error might be due to the mismatch in CUDA and the NVIDIA driver installed in your environment. We use PyTorch (i.e., torch.distributed) for invoking NCCL ops. It should not be a bug in vLLM.
Fix one_hot bug in torch compile mode
```
> block_mapping = torch.nn.functional.one_hot(metadata.block_mapping,
num_classes=batch_size)
E RuntimeError: Class values must be non-negative.
../../vllm/worker/hpu_model_runner.py:311: RuntimeError
```
multi-GPU offline inference
And When I try to run multi-GPU offline inference, it returns an error: the actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.
Unhandled exception: St13runtime_error. what(): NCCL Error 5: invalid usage
the detail error log:
cat /tmp/ray/session_latest/logs/python-core-worker-02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1_42067.log
[2023-07-11 15:20:05,144 I 42067 42067] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 42067
[2023-07-11 15:20:05,145 I 42067 42067] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-07-11 15:20:05,147 I 42067 42067] grpc_server.cc:140: worker server started, listening on port 43292.
[2023-07-11 15:20:05,151 I 42067 42067] core_worker.cc:217: Initializing worker at address: 172.25.1.37:43292, worker ID 02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1, raylet 2ec03b7f681a8bef6ffa0e368c13890542e6f29eaa380cf40026fb26
[2023-07-11 15:20:05,152 I 42067 42067] task_event_buffer.cc:184: Reporting task events to GCS every 1000ms.
[2023-07-11 15:20:05,152 I 42067 42067] core_worker.cc:605: Adjusted worker niceness to 15
[2023-07-11 15:20:05,152 I 42067 42253] core_worker.cc:553: Event stats:
Global stats: 12 total (8 active)
Queueing time: mean = 5.740 us, max = 49.684 us, min = 6.046 us, total = 68.876 us
Execution time: mean = 11.391 us, total = 136.696 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 4.235 us, total = 25.408 us
UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 33.060 us, total = 33.060 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 78.228 us, total = 78.228 us
Task Event stats:
IO Service Stats:
Global stats: 2 total (1 active)
Queueing time: mean = 2.611 us, max = 5.223 us, min = 5.223 us, total = 5.223 us
Execution time: mean = 3.402 us, total = 6.805 us
Event stats:
CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), CPU time: mean = 6.805 us, total = 6.805 us
Other Stats:
grpc_in_progress:0
current number of task events in buffer: 0
total task events sent: 0 MiB
total number of task events sent: 0
num status task events dropped: 0
num profile task events dropped: 0
[2023-07-11 15:20:05,152 I 42067 42067] event.cc:234: Set ray event level to warning
[2023-07-11 15:20:05,152 I 42067 42067] event.cc:342: Ray Event initialized for CORE_WORKER
[2023-07-11 15:20:05,152 I 42067 42253] accessor.cc:611: Received notification for node id = 2ec03b7f681a8bef6ffa0e368c13890542e6f29eaa380cf40026fb26, IsAlive = 1
[2023-07-11 15:20:05,152 I 42067 42253] core_worker.cc:4134: Number of alive nodes:1
[2023-07-11 15:20:05,159 I 42067 42067] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor b66dd417878bbd0cda7ea4e601000000
[2023-07-11 15:20:05,159 I 42067 42067] direct_actor_task_submitter.cc:237: Connecting to actor b66dd417878bbd0cda7ea4e601000000 at worker 02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1
[2023-07-11 15:20:05,159 I 42067 42067] core_worker.cc:2673: Creating actor: b66dd417878bbd0cda7ea4e601000000
[2023-07-11 15:20:15,496 W 42067 42248] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnknown: RPC Error message: Method not found!; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-07-11 15:20:41,680 E 42067 42067] logging.cc:97: Unhandled exception: St13runtime_error. what(): NCCL Error 5: invalid usage
[2023-07-11 15:20:41,788 E 42067 42067] logging.cc:104: Stack trace:
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdc551a) [0x7f7f16a1d51a] ray::operator<<()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdc7cd8) [0x7f7f16a1fcd8] ray::TerminateHandler()
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f7f158e035a] __cxxabiv1::__terminate()
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(+0xb03b9) [0x7f7f158df3b9]
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x87) [0x7f7f158dfae7] __gxx_personality_v0
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libgcc_s.so.1(+0x111e4) [0x7f7f158261e4] _Unwind_RaiseException_Phase2
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libgcc_s.so.1(_Unwind_Resume+0x12e) [0x7f7f15826c1e] _Unwind_Resume
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc6ecc3) [0x7f4f2788ccc3] torch::cuda::nccl::detail::throw_nccl_error()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xeb2077) [0x7f4f27ad0077] torch::cuda::nccl::AutoNcclGroup::~AutoNcclGroup()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc72417) [0x7f4f27890417] c10d::ProcessGroupNCCL::collective<>()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(ZN4c10d16ProcessGroupNCCL14allreduce_implERSt6vectorIN2at6TensorESaIS3_EERKNS_16AllreduceOptionsE+0x21) [0x7f4f27ae9e31] c10d::ProcessGroupNCCL::allreduce_impl()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(ZN4c10d16ProcessGroupNCCL9allreduceERSt6vectorIN2at6TensorESaIS3_EERKNS_16AllreduceOptionsE+0x39d) [0x7f4f27aecaed] c10d::ProcessGroupNCCL::allreduce()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x538ec64) [0x7f4f5206cc64] c10d::ops::allreduce_cuda()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x53917cf) [0x7f4f5206f7cf] c10::impl::wrap_kernel_functor_unboxed<>::call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x53ab0e3) [0x7f4f520890e3] c10d::ProcessGroup::allreduce()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xb6cc21) [0x7f4f666fdc21] pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x3b7040) [0x7f4f65f48040] pybind11::cpp_function::dispatcher()
ray::Worker.init(PyCFunction_Call+0x52) [0x4f5652] PyCFunction_Call
ray::Worker.init(_PyObject_MakeTpCall+0x3bb) [0x4e0c8b] _PyObject_MakeTpCall
ray::Worker.init() [0x4f53fd] method_vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x49a9) [0x4dc999] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x34e) [0x4f778e] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x1150) [0x4d9140] _PyEval_EvalFrameDefault
ray::Worker.init(_PyFunction_Vectorcall+0x106) [0x4e7fe6] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x6b2) [0x4d86a2] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyObject_FastCallDict+0x282) [0x4e0442] _PyObject_FastCallDict
ray::Worker.init() [0x4f1cd3] slot_tp_init
ray::Worker.init(_PyObject_MakeTpCall+0x3d3) [0x4e0ca3] _PyObject_MakeTpCall
ray::Worker.init(_PyEval_EvalFrameDefault+0x4fa6) [0x4dcf96] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x399) [0x4d8389] _PyEval_EvalFrameDefault
ray::Worker.init(_PyFunction_Vectorcall+0x106) [0x4e7fe6] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x24a) [0x4f768a] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x24a) [0x4f768a] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_FastCallDict+0x1d9) [0x4a6682] _PyFunction_FastCallDict
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x4f1f12) [0x7f7f16149f12] __Pyx_PyObject_Call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x53818b) [0x7f7f1619018b] __pyx_pw_3ray_7_raylet_12execute_task_3function_executor()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x4f1f12) [0x7f7f16149f12] __Pyx_PyObject_Call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x5a4091) [0x7f7f161fc091] __pyx_f_3ray_7_raylet_task_execution_handler()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_RSO_INS0_17LocalMemoryBufferEEPbPSsRKSN_INS0_16ConcurrencyGroupESaIS1B_EESsbbEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S18_S19_S1A_S1F_SsbbEE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1P_S18_OS19_OS1A_S1F_S1O_ObS1S+0x12b) [0x7f7f16150ecb] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0xb7f) [0x7f7f1630da5f] ray::core::CoreWorker::ExecuteTask()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDES5_INS0_9RayObjectEEESaISO_EESR_PN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_RKSK_SR_SR_SY_SZ_S10_EPS14_St12_PlaceholderILi1EES1A_ILi2EES1A_ILi3EES1A_ILi4EES1A_ILi5EES1A_ILi6EES1A_ILi7EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSR_S1P_OSY_OSZ_OS10+0x54) [0x7f7f1624ee14] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6de67e) [0x7f7f1633667e] ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6df7da) [0x7f7f163377da] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6f25fe) [0x7f7f1634a5fe] ray::core::InboundRequest::Accept()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6c63d0) [0x7f7f1631e3d0] ray::core::NormalSchedulingQueue::ScheduleRequests()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x97e876) [0x7f7f165d6876] EventTracker::RecordExecution()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x91bcce) [0x7f7f16573cce] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x91c226) [0x7f7f16574226] boost::asio::detail::completion_handler<>::do_complete()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdd706b) [0x7f7f16a2f06b] boost::asio::detail::scheduler::do_run_one()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdd8b39) [0x7f7f16a30b39] boost::asio::detail::scheduler::run()
[2023-07-11 15:20:41,793 E 42067 42067] logging.cc:361: *** SIGABRT received at time=1689060041 on cpu 34 ***
[2023-07-11 15:20:41,793 E 42067 42067] logging.cc:361: PC: @ 0x7f7f1d831387 (unknown) raise
[2023-07-11 15:20:41,795 E 42067 42067] logging.cc:361: @ 0x7f7f1e2e1630 (unknown) (unknown)
[2023-07-11 15:20:41,795 E 42067 42067] logging.cc:361: @ 0x7f7f158e035a 992 __cxxabiv1::__terminate()
[2023-07-11 15:20:41,797 E 42067 42067] logging.cc:361: @ 0x7ffca2d9acf0 248 (unknown)
[2023-07-11 15:20:41,800 E 42067 42067] logging.cc:361: @ 0x7f7f1dbc27b8 (unknown) (unknown)
[2023-07-11 15:20:41,800 E 42067 42067] logging.cc:361: @ ... and at least 2 more frames
The text was updated successfully, but these errors were encountered: