Question regarding the load importance loss calculation #240

wangyirui · 2024-06-15T14:46:16Z

Hi, when studying the load importance loss, I found the parameters passed to the function load_importance_loss are softmax normalized scores and logits with noise (see moe_layer.py L281). I am wondering why we use the softmax normalized score to calculate the diff against the raw logits with noise? Why not consistently use the softmax output for both? Thanks!

The text was updated successfully, but these errors were encountered:

ghostplant · 2024-06-15T15:49:48Z

Hi, the standard GShard MoE follows the branch self.is_gshard_loss == True, while the loss option you pointed out is designed and preferred by Swin-Transformer MoE.

According to load_importance_loss defined in https://github.com/microsoft/tutel/blob/main/tutel/impls/losses.py#L29, it requires normalization to perform directly on the score tensor without doing noise which avoids normalization results to be polluted by the noise. @zeliu98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding the load importance loss calculation #240

Question regarding the load importance loss calculation #240

wangyirui commented Jun 15, 2024

ghostplant commented Jun 15, 2024

Question regarding the load importance loss calculation #240

Question regarding the load importance loss calculation #240

Comments

wangyirui commented Jun 15, 2024

ghostplant commented Jun 15, 2024