-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[transformer] Add moe_noisy_gate #2495
base: main
Are you sure you want to change the base?
Conversation
先merge一下main |
有paper link的话可以贴一下 |
好咧,参考的是谷歌的文章:https://arxiv.org/pdf/1701.06538.pdf |
贴class下边 这个作用是加速收敛呢 还是最终效果也会变好 |
if self.gate_type == 'noisy': | ||
noisy_router = self.noisy_gate(xs) | ||
noisy_router = torch.randn_like(router) * F.softplus(noisy_router) | ||
router = router + noisy_router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
推理阶段也需要吗?我理解这个更像是服务于训练阶段避免有的专家没参与训练的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
推理理论上是不用的,这个可以做个实验测试一下相差多少
我跑个完整的epoch看看,之前测试的结果是最终效果也会变好,不过当时的moe不是用在encoder上 |
好咧周神,这个我研究一下 |
咋样啦,有最终结果了不 |
模型还在训,卡有点慢。。明天能有结果 |
蹲 |
测试了一下noisy-moe在decoder的性能, 感觉跟大模型一样,用在decoder的表现会更好 encoder-decoder的moe我显存不够跑不了,还需要各位大佬来验证一下效果了
|
为啥在decoder效果更好 |
个人感觉小数据量的encoder-moe 加noisy在训练可能更均衡了 但是很难训练充分,所以效果会更差 现在也在尝试只在后几层做moe,看看效果 |
更新一下周神贴的方法的实验结果,encoder专家数量需要根据数据量来确定,太稀疏会影响性能
|
请问
感谢。 |
|
不用测流式的啦,就是想知道一下这个解码的策略。 |
增加了noisy-gate
实验结果(aishell-1 20epoch)
how to use: