By default, in megatron, condition fais on ibgda_get_state()->num_rc_per_pe == num_channels || ibgda_get_state()->num_rc_per_pe >= num_sms 

After updated deepep to 7f97f79bda051d991aa9681c04138007faa0366, it fails:

```
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 156, in forward
2025-06-19 16:18 [rank5]:     output, mlp_bias = custom_forward(hidden_states)
2025-06-19 16:18 [rank5]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/moe_layer.py", line 141, in custom_forward
2025-06-19 16:18 [rank5]:     (dispatched_input, tokens_per_expert) = self.token_dispatcher.token_permutation(
2025-06-19 16:18 [rank5]:                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py", line 982, in token_permutation
2025-06-19 16:18 [rank5]:     hidden_states = self._comm_manager.dispatch(hidden_states)
2025-06-19 16:18 [rank5]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/token_dispatcher.py", line 826, in dispatch
2025-06-19 16:18 [rank5]:     fused_dispatch(
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/fused_a2a.py", line 182, in fused_dispatch
2025-06-19 16:18 [rank5]:     return FusedDispatch.apply(
2025-06-19 16:18 [rank5]:            ^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 578, in apply
2025-06-19 16:18 [rank5]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
2025-06-19 16:18 [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/opt/Megatron-LM/megatron/core/transformer/moe/fused_a2a.py", line 98, in forward
2025-06-19 16:18 [rank5]:     ) = buffer.dispatch(
2025-06-19 16:18 [rank5]:         ^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/usr/local/lib/python3.12/dist-packages/deep_ep/buffer.py", line 311, in dispatch
2025-06-19 16:18 [rank5]:     return self.internode_dispatch(x, handle, num_tokens_per_rank, num_tokens_per_rdma_rank, is_token_in_rank, num_tokens_per_expert,
2025-06-19 16:18 [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]:   File "/usr/local/lib/python3.12/dist-packages/deep_ep/buffer.py", line 421, in internode_dispatch
2025-06-19 16:18 [rank5]:     recv_src_meta, send_rdma_head, send_nvl_head, event = self.runtime.internode_dispatch(
2025-06-19 16:18 [rank5]:                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-06-19 16:18 [rank5]: RuntimeError: Failed: CUDA error /opt/DeepEP/csrc/kernels/internode.cu:980 'unspecified launch failure'
2025-06-19 16:20 terminate called after throwing an instance of 'c10::Error'
2025-06-19 16:20   what():  CUDA error: unspecified launch failure
2025-06-19 16:20 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2025-06-19 16:20 
2025-06-19 16:20 Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
2025-06-19 16:20 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f502a38b368 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f502a3204a6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f50359ae2a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #3: <unknown function> + 0x1e79f (0x7f503597679f in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #4: <unknown function> + 0x20060 (0x7f5035978060 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #5: <unknown function> + 0x2028c (0x7f503597828c in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
2025-06-19 16:20 frame #6: <unknown function> + 0x449af2 (0x7f502a849af2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f502a365cb9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
2025-06-19 16:20 frame #8: <unknown function> + 0x6ff638 (0x7f502aaff638 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #9: <unknown function> + 0x6ffa60 (0x7f502aaffa60 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
2025-06-19 16:20 frame #10: /usr/bin/python() [0x559121]
2025-06-19 16:20 frame #11: /usr/bin/python() [0x610bf5]
2025-06-19 16:20 frame #12: /usr/bin/python() [0x610c05]
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

By default, in megatron, condition fais on ibgda_get_state()->num_rc_per_pe == num_channels || ibgda_get_state()->num_rc_per_pe >= num_sms #226

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

By default, in megatron, condition fais on ibgda_get_state()->num_rc_per_pe == num_channels || ibgda_get_state()->num_rc_per_pe >= num_sms #226

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions