Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL Error 5: invalid usage #427

Closed
murongweibo opened this issue Jul 11, 2023 · 2 comments
Closed

NCCL Error 5: invalid usage #427

murongweibo opened this issue Jul 11, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@murongweibo
Copy link

multi-GPU offline inference
And When I try to run multi-GPU offline inference, it returns an error: the actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
The actor never ran - it was cancelled before it started running.

Unhandled exception: St13runtime_error. what(): NCCL Error 5: invalid usage

the detail error log:
cat /tmp/ray/session_latest/logs/python-core-worker-02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1_42067.log
[2023-07-11 15:20:05,144 I 42067 42067] core_worker_process.cc:107: Constructing CoreWorkerProcess. pid: 42067
[2023-07-11 15:20:05,145 I 42067 42067] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-07-11 15:20:05,147 I 42067 42067] grpc_server.cc:140: worker server started, listening on port 43292.
[2023-07-11 15:20:05,151 I 42067 42067] core_worker.cc:217: Initializing worker at address: 172.25.1.37:43292, worker ID 02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1, raylet 2ec03b7f681a8bef6ffa0e368c13890542e6f29eaa380cf40026fb26
[2023-07-11 15:20:05,152 I 42067 42067] task_event_buffer.cc:184: Reporting task events to GCS every 1000ms.
[2023-07-11 15:20:05,152 I 42067 42067] core_worker.cc:605: Adjusted worker niceness to 15
[2023-07-11 15:20:05,152 I 42067 42253] core_worker.cc:553: Event stats:

Global stats: 12 total (8 active)
Queueing time: mean = 5.740 us, max = 49.684 us, min = 6.046 us, total = 68.876 us
Execution time: mean = 11.391 us, total = 136.696 us
Event stats:
PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 4.235 us, total = 25.408 us
UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 33.060 us, total = 33.060 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 78.228 us, total = 78.228 us


Task Event stats:

IO Service Stats:

Global stats: 2 total (1 active)
Queueing time: mean = 2.611 us, max = 5.223 us, min = 5.223 us, total = 5.223 us
Execution time: mean = 3.402 us, total = 6.805 us
Event stats:
CoreWorker.deadline_timer.flush_task_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
PeriodicalRunner.RunFnPeriodically - 1 total (0 active), CPU time: mean = 6.805 us, total = 6.805 us
Other Stats:
grpc_in_progress:0
current number of task events in buffer: 0
total task events sent: 0 MiB
total number of task events sent: 0
num status task events dropped: 0
num profile task events dropped: 0

[2023-07-11 15:20:05,152 I 42067 42067] event.cc:234: Set ray event level to warning
[2023-07-11 15:20:05,152 I 42067 42067] event.cc:342: Ray Event initialized for CORE_WORKER
[2023-07-11 15:20:05,152 I 42067 42253] accessor.cc:611: Received notification for node id = 2ec03b7f681a8bef6ffa0e368c13890542e6f29eaa380cf40026fb26, IsAlive = 1
[2023-07-11 15:20:05,152 I 42067 42253] core_worker.cc:4134: Number of alive nodes:1
[2023-07-11 15:20:05,159 I 42067 42067] direct_actor_task_submitter.cc:36: Set max pending calls to -1 for actor b66dd417878bbd0cda7ea4e601000000
[2023-07-11 15:20:05,159 I 42067 42067] direct_actor_task_submitter.cc:237: Connecting to actor b66dd417878bbd0cda7ea4e601000000 at worker 02e363d59eb469c6fe7c4719cfdd04158e231ad34605def092aacdb1
[2023-07-11 15:20:05,159 I 42067 42067] core_worker.cc:2673: Creating actor: b66dd417878bbd0cda7ea4e601000000
[2023-07-11 15:20:15,496 W 42067 42248] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnknown: RPC Error message: Method not found!; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-07-11 15:20:41,680 E 42067 42067] logging.cc:97: Unhandled exception: St13runtime_error. what(): NCCL Error 5: invalid usage
[2023-07-11 15:20:41,788 E 42067 42067] logging.cc:104: Stack trace:
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdc551a) [0x7f7f16a1d51a] ray::operator<<()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdc7cd8) [0x7f7f16a1fcd8] ray::TerminateHandler()
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(+0xb135a) [0x7f7f158e035a] __cxxabiv1::__terminate()
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(+0xb03b9) [0x7f7f158df3b9]
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libstdc++.so.6(__gxx_personality_v0+0x87) [0x7f7f158dfae7] __gxx_personality_v0
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libgcc_s.so.1(+0x111e4) [0x7f7f158261e4] _Unwind_RaiseException_Phase2
/home/sysdocker/miniconda2/envs/fashchat_env/bin/../lib/libgcc_s.so.1(_Unwind_Resume+0x12e) [0x7f7f15826c1e] _Unwind_Resume
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc6ecc3) [0x7f4f2788ccc3] torch::cuda::nccl::detail::throw_nccl_error()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xeb2077) [0x7f4f27ad0077] torch::cuda::nccl::AutoNcclGroup::~AutoNcclGroup()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xc72417) [0x7f4f27890417] c10d::ProcessGroupNCCL::collective<>()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(ZN4c10d16ProcessGroupNCCL14allreduce_implERSt6vectorIN2at6TensorESaIS3_EERKNS_16AllreduceOptionsE+0x21) [0x7f4f27ae9e31] c10d::ProcessGroupNCCL::allreduce_impl()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(ZN4c10d16ProcessGroupNCCL9allreduceERSt6vectorIN2at6TensorESaIS3_EERKNS_16AllreduceOptionsE+0x39d) [0x7f4f27aecaed] c10d::ProcessGroupNCCL::allreduce()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x538ec64) [0x7f4f5206cc64] c10d::ops::allreduce_cuda
()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x53917cf) [0x7f4f5206f7cf] c10::impl::wrap_kernel_functor_unboxed
<>::call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x53ab0e3) [0x7f4f520890e3] c10d::ProcessGroup::allreduce()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xb6cc21) [0x7f4f666fdc21] pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x3b7040) [0x7f4f65f48040] pybind11::cpp_function::dispatcher()
ray::Worker.init(PyCFunction_Call+0x52) [0x4f5652] PyCFunction_Call
ray::Worker.init(_PyObject_MakeTpCall+0x3bb) [0x4e0c8b] _PyObject_MakeTpCall
ray::Worker.init() [0x4f53fd] method_vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x49a9) [0x4dc999] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x34e) [0x4f778e] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x1150) [0x4d9140] _PyEval_EvalFrameDefault
ray::Worker.init(_PyFunction_Vectorcall+0x106) [0x4e7fe6] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x6b2) [0x4d86a2] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyObject_FastCallDict+0x282) [0x4e0442] _PyObject_FastCallDict
ray::Worker.init() [0x4f1cd3] slot_tp_init
ray::Worker.init(_PyObject_MakeTpCall+0x3d3) [0x4e0ca3] _PyObject_MakeTpCall
ray::Worker.init(_PyEval_EvalFrameDefault+0x4fa6) [0x4dcf96] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(_PyEval_EvalFrameDefault+0x399) [0x4d8389] _PyEval_EvalFrameDefault
ray::Worker.init(_PyFunction_Vectorcall+0x106) [0x4e7fe6] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x24a) [0x4f768a] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_Vectorcall+0x19c) [0x4e807c] _PyFunction_Vectorcall
ray::Worker.init(PyObject_Call+0x24a) [0x4f768a] PyObject_Call
ray::Worker.init(_PyEval_EvalFrameDefault+0x1f7b) [0x4d9f6b] _PyEval_EvalFrameDefault
ray::Worker.init(_PyEval_EvalCodeWithName+0x2f1) [0x4d6fb1] _PyEval_EvalCodeWithName
ray::Worker.init(_PyFunction_FastCallDict+0x1d9) [0x4a6682] _PyFunction_FastCallDict
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x4f1f12) [0x7f7f16149f12] __Pyx_PyObject_Call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x53818b) [0x7f7f1619018b] __pyx_pw_3ray_7_raylet_12execute_task_3function_executor()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x4f1f12) [0x7f7f16149f12] __Pyx_PyObject_Call()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x5a4091) [0x7f7f161fc091] __pyx_f_3ray_7_raylet_task_execution_handler()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFN3ray6StatusERKNS0_3rpc7AddressENS2_8TaskTypeESsRKNS0_4core11RayFunctionERKSt13unordered_mapISsdSt4hashISsESt8equal_toISsESaISt4pairIKSsdEEERKSt6vectorISt10shared_ptrINS0_9RayObjectEESaISQ_EERKSN_INS2_15ObjectReferenceESaISV_EERSH_S10_PSN_ISG_INS0_8ObjectIDESQ_ESaIS12_EES15_RSO_INS0_17LocalMemoryBufferEEPbPSsRKSN_INS0_16ConcurrencyGroupESaIS1B_EESsbbEPFS1_S5_S6_SsSA_SM_SU_SZ_SsSsS15_S15_S18_S19_S1A_S1F_SsbbEE9_M_invokeERKSt9_Any_dataS5_OS6_OSsSA_SM_SU_SZ_S10_S10_OS15_S1P_S18_OS19_OS1A_S1F_S1O_ObS1S+0x12b) [0x7f7f16150ecb] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker11ExecuteTaskERKNS_17TaskSpecificationERKSt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS_8ObjectIDES5_INS_9RayObjectEEESaISQ_EEST_PN6google8protobuf16RepeatedPtrFieldINS_3rpc20ObjectReferenceCountEEEPbPSs+0xb7f) [0x7f7f1630da5f] ray::core::CoreWorker::ExecuteTask()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(ZNSt17_Function_handlerIFN3ray6StatusERKNS0_17TaskSpecificationESt10shared_ptrISt13unordered_mapISsSt6vectorISt4pairIldESaIS9_EESt4hashISsESt8equal_toISsESaIS8_IKSsSB_EEEEPS7_IS8_INS0_8ObjectIDES5_INS0_9RayObjectEEESaISO_EESR_PN6google8protobuf16RepeatedPtrFieldINS0_3rpc20ObjectReferenceCountEEEPbPSsESt5_BindIFMNS0_4core10CoreWorkerEFS1_S4_RKSK_SR_SR_SY_SZ_S10_EPS14_St12_PlaceholderILi1EES1A_ILi2EES1A_ILi3EES1A_ILi4EES1A_ILi5EES1A_ILi6EES1A_ILi7EEEEE9_M_invokeERKSt9_Any_dataS4_OSK_OSR_S1P_OSY_OSZ_OS10+0x54) [0x7f7f1624ee14] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6de67e) [0x7f7f1633667e] ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6df7da) [0x7f7f163377da] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6f25fe) [0x7f7f1634a5fe] ray::core::InboundRequest::Accept()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x6c63d0) [0x7f7f1631e3d0] ray::core::NormalSchedulingQueue::ScheduleRequests()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x97e876) [0x7f7f165d6876] EventTracker::RecordExecution()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x91bcce) [0x7f7f16573cce] std::_Function_handler<>::_M_invoke()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0x91c226) [0x7f7f16574226] boost::asio::detail::completion_handler<>::do_complete()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdd706b) [0x7f7f16a2f06b] boost::asio::detail::scheduler::do_run_one()
/home/sysdocker/miniconda2/envs/fashchat_env/lib/python3.8/site-packages/ray/_raylet.so(+0xdd8b39) [0x7f7f16a30b39] boost::asio::detail::scheduler::run()

[2023-07-11 15:20:41,793 E 42067 42067] logging.cc:361: *** SIGABRT received at time=1689060041 on cpu 34 ***
[2023-07-11 15:20:41,793 E 42067 42067] logging.cc:361: PC: @ 0x7f7f1d831387 (unknown) raise
[2023-07-11 15:20:41,795 E 42067 42067] logging.cc:361: @ 0x7f7f1e2e1630 (unknown) (unknown)
[2023-07-11 15:20:41,795 E 42067 42067] logging.cc:361: @ 0x7f7f158e035a 992 __cxxabiv1::__terminate()
[2023-07-11 15:20:41,797 E 42067 42067] logging.cc:361: @ 0x7ffca2d9acf0 248 (unknown)
[2023-07-11 15:20:41,800 E 42067 42067] logging.cc:361: @ 0x7f7f1dbc27b8 (unknown) (unknown)
[2023-07-11 15:20:41,800 E 42067 42067] logging.cc:361: @ ... and at least 2 more frames

@wehos
Copy link

wehos commented Jul 11, 2023

Same issues here. Reproduced by running the example codes:

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

Tried two sets of clean environment, both failed with the same error:
env1.txt
env2.txt

================
Update:

Problem solved by installing proper version of CUDA and Nvidia Driver. In my case, it's CUDA 11.8 and Nvidia driver 520.00.

@WoosukKwon WoosukKwon added the bug Something isn't working label Jul 13, 2023
@WoosukKwon
Copy link
Collaborator

Hi @murongweibo, thanks for reporting the bug. And thanks @wehos for sharing your experience!

@murongweibo is your problem solved now? As @wehos mentioned, your error might be due to the mismatch in CUDA and the NVIDIA driver installed in your environment. We use PyTorch (i.e., torch.distributed) for invoking NCCL ops. It should not be a bug in vLLM.

jikunshang pushed a commit to jikunshang/vllm that referenced this issue Oct 31, 2024
Fix one_hot bug in torch compile mode
```
>           block_mapping = torch.nn.functional.one_hot(metadata.block_mapping,
                                                        num_classes=batch_size)
E           RuntimeError: Class values must be non-negative.

../../vllm/worker/hpu_model_runner.py:311: RuntimeError
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants