Skip to content

By default, in megatron, condition fais on ibgda_get_state()->num_rc_per_pe == num_channels || ibgda_get_state()->num_rc_per_pe >= num_sms  #226

@kuozhang

Description

@kuozhang

After updated deepep to 7f97f79bda051d991aa9681c04138007faa0366, it fails:

2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 156, in forward
2025-06-19 16:18 [rank5]:     output, mlp_bias = custom_forward(hidden_states)
2025-06-19 16:18 [rank5]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 141, in custom_forward
2025-06-19 16:18 [rank5]:     (dispatched_input, tokens_per_expert) = self.token_dispatcher.token_permutation(
2025-06-19 16:18 [rank5]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py", line 982, in token_permutation
2025-06-19 16:18 [rank5]:     hidden_states = self._comm_manager.dispatch(hidden_states)
2025-06-19 16:18 [rank5]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py", line 826, in dispatch
2025-06-19 16:18 [rank5]:     fused_dispatch(
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/fused_a2a.py", line 182, in fused_dispatch
2025-06-19 16:18 [rank5]:     return FusedDispatch.apply(
2025-06-19 16:18 [rank5]:            ^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 578, in apply
2025-06-19 16:18 [rank5]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
2025-06-19 16:18 [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/fused_a2a.py", line 98, in forward
2025-06-19 16:18 [rank5]:     ) = buffer.dispatch(
2025-06-19 16:18 [rank5]:         ^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/usr/local/lib/python3.12/dist-packages/deep_ep/buffer.py", line 311, in dispatch
2025-06-19 16:18 [rank5]:     return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
2025-06-19 16:18 [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/usr/local/lib/python3.12/dist-packages/deep_ep/buffer.py", line 421, in internode_dispatch
2025-06-19 16:18 [rank5]:     recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
2025-06-19 16:18 [rank5]:                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: RuntimeError: Failed: CUDA error /opt/DeepEP/csrc/kernels/internode.cu:980 'unspecified launch failure'
2025-06-19 16:20 terminate called after throwing an instance of 'c10::Error'
2025-06-19 16:20   what():  CUDA error: unspecified launch failure
2025-06-19 16:20 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-06-19 16:20 
2025-06-19 16:20 Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
2025-06-19 16:20 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f502a38b368 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f502a3204a6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f50359ae2a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #3: <unknown function> + 0x1e79f (0x7f503597679f in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #4: <unknown function> + 0x20060 (0x7f5035978060 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #5: <unknown function> + 0x2028c (0x7f503597828c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #6: <unknown function> + 0x449af2 (0x7f502a849af2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f502a365cb9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #8: <unknown function> + 0x6ff638 (0x7f502aaff638 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #9: <unknown function> + 0x6ffa60 (0x7f502aaffa60 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #10: /usr/bin/python() [0x559121]
2025-06-19 16:20 frame #11: /usr/bin/python() [0x610bf5]
2025-06-19 16:20 frame #12: /usr/bin/python() [0x610c05]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions