How to use Megablocks in MoE training #236

CSCYQJ · 2024-06-05T08:56:08Z

I noticed that "Tutel v0.3: Add Megablocks solution to improve decoder inference on single-GPU with num_local_expert >= 2", but when I use megablocks in MoE training (dropless-MoE), the following error occurred:

And I found the reason may be that torch.ops.tutel_ops.sparse_bmm_infer doesn't support backward operation.

The text was updated successfully, but these errors were encountered:

ghostplant · 2024-06-06T06:21:29Z

Megablocks is disabled in training mode as the optimization isn't useful for models having single expert per GPU, especially for huge-scale training. So in training mode, please set megablocks_size=0 if self.training

Megablocks's two assumptions: (1) has to be > 1 local expert per GPU; (2) has to be imbalanced for local experts. Unless you want to train an imbalanced model on purpose by disabling balanced loss, Megablocks won't be helpful to training performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use Megablocks in MoE training #236

How to use Megablocks in MoE training #236

CSCYQJ commented Jun 5, 2024

ghostplant commented Jun 6, 2024 •

edited

Loading

How to use Megablocks in MoE training #236

How to use Megablocks in MoE training #236

Comments

CSCYQJ commented Jun 5, 2024

ghostplant commented Jun 6, 2024 • edited Loading

ghostplant commented Jun 6, 2024 •

edited

Loading