You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, when studying the load importance loss, I found the parameters passed to the function load_importance_loss are softmax normalized scores and logits with noise (see moe_layer.py L281). I am wondering why we use the softmax normalized score to calculate the diff against the raw logits with noise? Why not consistently use the softmax output for both? Thanks!
The text was updated successfully, but these errors were encountered:
Hi, the standard GShard MoE follows the branch self.is_gshard_loss == True, while the loss option you pointed out is designed and preferred by Swin-Transformer MoE.
Hi, when studying the load importance loss, I found the parameters passed to the function
load_importance_loss
are softmax normalized scores and logits with noise (see moe_layer.py L281). I am wondering why we use the softmax normalized score to calculate the diff against the raw logits with noise? Why not consistently use the softmax output for both? Thanks!The text was updated successfully, but these errors were encountered: