-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Description
After updated deepep to 7f97f79bda051d991aa9681c04138007faa0366, it fails:
2025-06-19 16:18 [rank5]: File "/opt/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 156, in forward
2025-06-19 16:18 [rank5]: output, mlp_bias = custom_forward(hidden_states)
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/opt/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 141, in custom_forward
2025-06-19 16:18 [rank5]: (dispatched_input, tokens_per_expert) = self.token_dispatcher.token_permutation(
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/opt/Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py", line 982, in token_permutation
2025-06-19 16:18 [rank5]: hidden_states = self._comm_manager.dispatch(hidden_states)
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/opt/Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py", line 826, in dispatch
2025-06-19 16:18 [rank5]: fused_dispatch(
2025-06-19 16:18 [rank5]: File "/opt/Megatron-LM/megatron/core/transformer/moe/fused_a2a.py", line 182, in fused_dispatch
2025-06-19 16:18 [rank5]: return FusedDispatch.apply(
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 578, in apply
2025-06-19 16:18 [rank5]: return super().apply(*args, **kwargs) # type: ignore[misc]
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/opt/Megatron-LM/megatron/core/transformer/moe/fused_a2a.py", line 98, in forward
2025-06-19 16:18 [rank5]: ) = buffer.dispatch(
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/usr/local/lib/python3.12/dist-packages/deep_ep/buffer.py", line 311, in dispatch
2025-06-19 16:18 [rank5]: return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: File "/usr/local/lib/python3.12/dist-packages/deep_ep/buffer.py", line 421, in internode_dispatch
2025-06-19 16:18 [rank5]: recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
2025-06-19 16:18 [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: RuntimeError: Failed: CUDA error /opt/DeepEP/csrc/kernels/internode.cu:980 'unspecified launch failure'
2025-06-19 16:20 terminate called after throwing an instance of 'c10::Error'
2025-06-19 16:20 what(): CUDA error: unspecified launch failure
2025-06-19 16:20 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-06-19 16:20
2025-06-19 16:20 Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
2025-06-19 16:20 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f502a38b368 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f502a3204a6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f50359ae2a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #3: <unknown function> + 0x1e79f (0x7f503597679f in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #4: <unknown function> + 0x20060 (0x7f5035978060 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #5: <unknown function> + 0x2028c (0x7f503597828c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #6: <unknown function> + 0x449af2 (0x7f502a849af2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f502a365cb9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #8: <unknown function> + 0x6ff638 (0x7f502aaff638 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #9: <unknown function> + 0x6ffa60 (0x7f502aaffa60 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #10: /usr/bin/python() [0x559121]
2025-06-19 16:20 frame #11: /usr/bin/python() [0x610bf5]
2025-06-19 16:20 frame #12: /usr/bin/python() [0x610c05]
Metadata
Metadata
Assignees
Labels
No labels