[transformer] Add moe_noisy_gate #2495

llleohk · 2024-04-18T07:58:47Z

增加了noisy-gate
实验结果（aishell-1 20epoch）

decoding mode	Normal Gate	Noisy Gate
ctc_prefix_beam_search	9.60%	8.88%
att_rescoring	8.97%	8.23%

how to use:

xingchensong · 2024-04-18T09:19:23Z

先merge一下main

xingchensong · 2024-04-18T09:49:28Z

有paper link的话可以贴一下

llleohk · 2024-04-18T09:54:46Z

有paper link的话可以贴一下

好咧，参考的是谷歌的文章：https://arxiv.org/pdf/1701.06538.pdf

Mddct · 2024-04-18T09:58:06Z

贴class下边
好奇完整的epoch跑完会咋样

这个作用是加速收敛呢还是最终效果也会变好

xingchensong · 2024-04-18T09:58:44Z

wenet/transformer/positionwise_feed_forward.py

+        if self.gate_type == 'noisy':
+            noisy_router = self.noisy_gate(xs)
+            noisy_router = torch.randn_like(router) * F.softplus(noisy_router)
+            router = router + noisy_router


推理阶段也需要吗？我理解这个更像是服务于训练阶段避免有的专家没参与训练的

推理理论上是不用的，这个可以做个实验测试一下相差多少

llleohk · 2024-04-18T10:20:09Z

贴class下边好奇完整的epoch跑完会咋样

这个作用是加速收敛呢还是最终效果也会变好

我跑个完整的epoch看看，之前测试的结果是最终效果也会变好，不过当时的moe不是用在encoder上

Mddct · 2024-04-20T16:09:50Z

后边会支持模型并行， moe这里需要特殊的处理，看到了这个参考下截个图放这里 ref：https://zhuanlan.zhihu.com/p/681154742

llleohk · 2024-04-22T01:58:37Z

后边会支持模型并行， moe这里需要特殊的处理，看到了这个参考下截个图放这里 ref：https://zhuanlan.zhihu.com/p/681154742

好咧周神，这个我研究一下

xingchensong · 2024-04-25T09:38:54Z

咋样啦，有最终结果了不

llleohk · 2024-04-25T10:44:21Z

咋样啦，有最终结果了不

模型还在训，卡有点慢。。明天能有结果

rookie0607 · 2024-04-27T03:19:51Z

蹲

llleohk · 2024-04-27T09:15:15Z

来了来了，结果来了：
用的aishell-1，训练个100个epochs，encoder-moe

decoding mode	Normal Gate	Noisy Gate(train and decode)	Noisy Gate (only train)
ctc_prefix_beam_search	5.60%	5.62%	5.62%
att_rescoring	5.23%	5.27%	5.27%

从结果来看感觉加noisy没啥效果，不排除是不是数据量不够多的原因。。。而且推理加noisy和不加效果一样，我简单测了一下门控输出一致性，大概是96%。
从log的loss来看，noisy的收敛是比normal要快，但是最后收敛的都差不多，这里贴个图：

然后测了1000条音频的门控输出的标准差平均值，好像起不到负载均衡的效果。。
Normal_std：38.99761676367856
Noisy_only_train_std：41.180588894741426
Noisy_train_decode_std: 41.19181262290304

llleohk · 2024-05-02T04:08:47Z

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了

	ctc_prefix_beam_search	att_rescoring
U2++-baseline	5.80%	5.06%
Normal Gate-Encoder	5.60%	5.23%
Noisy Gate(decode)-Encoder	5.62%	5.27%
Noisy Gate(only train)-Encoder	5.62%	5.27%
Normal Gate-Decoder	5.83%	5.07%
Noisy Gate(decode)-Decoder	5.77%	4.99%
Noisy Gate(only train)-Decoder	5.77%	4.99%

fclearner · 2024-05-06T03:13:44Z

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

为啥在decoder效果更好

llleohk · 2024-05-07T10:29:55Z

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好
encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了
ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

为啥在decoder效果更好

个人感觉小数据量的encoder-moe 加noisy在训练可能更均衡了但是很难训练充分，所以效果会更差

现在也在尝试只在后几层做moe，看看效果

llleohk · 2024-05-20T05:49:03Z

更新一下周神贴的方法的实验结果，encoder专家数量需要根据数据量来确定，太稀疏会影响性能

	ctc_prefix_beam_search	att_rescoring
U2++-baseline	5.80%	5.06%
Normal Gate-Encoder	5.60%	5.23%
Noisy Gate-Encoder	5.62%	5.27%
Normal Gate-Decoder	5.83%	5.07%
Noisy Gate-Decoder	5.77%	4.99%
mask Noisy Gate(4experts)-Encoder	5.46%	5.06%
mask Noisy Gate(8experts)-Encoder	5.82%	5.40%
mask Noisy Gate(4experts)-Decoder	5.85%	5.09%
mask Noisy Gate(8experts)-Decoder	5.76%	5.04%

MXuer · 2024-05-21T01:46:38Z

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好

encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了

ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

上面的cer解码是流式的还是非流式的啊。
最新一条里面的，u2++-baseline，这个attention rescoring在aishell readme里面能到4.63%，您这个是因为只训练了100个epoch是吗？

感谢。

llleohk · 2024-05-21T02:07:29Z

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好
encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了
ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

上面的cer解码是流式的还是非流式的啊。

最新一条里面的，u2++-baseline，这个attention rescoring在aishell readme里面能到4.63%，您这个是因为只训练了100个epoch是吗？

感谢。

cer解码的是非流式的，如果您需要的话我可以测试一下流式的结果
我的u2++-baseline没有完全对齐aishell readme里的训练参数，我是4卡，batch size是8，训练100个epoch；decode的时候average_num设的5

MXuer · 2024-05-21T02:09:32Z

测试了一下noisy-moe在decoder的性能，感觉跟大模型一样，用在decoder的表现会更好
encoder-decoder的moe我显存不够跑不了，还需要各位大佬来验证一下效果了
ctc_prefix_beam_search att_rescoring
U2++-baseline 5.80% 5.06%
Normal Gate-Encoder 5.60% 5.23%
Noisy Gate(decode)-Encoder 5.62% 5.27%
Noisy Gate(only train)-Encoder 5.62% 5.27%
Normal Gate-Decoder 5.83% 5.07%
Noisy Gate(decode)-Decoder 5.77% 4.99%
Noisy Gate(only train)-Decoder 5.77% 4.99%

请问

上面的cer解码是流式的还是非流式的啊。

最新一条里面的，u2++-baseline，这个attention rescoring在aishell readme里面能到4.63%，您这个是因为只训练了100个epoch是吗？

感谢。

cer解码的是非流式的，如果您需要的话我可以测试一下流式的结果

我的u2++-baseline没有完全对齐aishell readme里的训练参数，我是4卡，batch size是8，训练100个epoch；decode的时候average_num设的5

不用测流式的啦，就是想知道一下这个解码的策略。
感谢您的回答，感谢您的分享。

llleohk added 3 commits April 18, 2024 15:37

增加noisy_gate

906832b

add noisy_gate

bc6fe95

add noisy_gate

1056712

Mddct requested a review from xingchensong April 18, 2024 08:42

add noisy_gate

8a114bd

xingchensong reviewed Apr 18, 2024

View reviewed changes

Merge branch 'main' into moe_noisy_gate

d24a9f4

llleohk added 3 commits May 2, 2024 23:05

only training noisy

dafe8c0

fix length

f5eeb5c

fix length

954aa10

Merge branch 'wenet-e2e:main' into moe_noisy_gate

cf6c82c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[transformer] Add moe_noisy_gate #2495

[transformer] Add moe_noisy_gate #2495

llleohk commented Apr 18, 2024

xingchensong commented Apr 18, 2024

xingchensong commented Apr 18, 2024

llleohk commented Apr 18, 2024

Mddct commented Apr 18, 2024

xingchensong Apr 18, 2024

llleohk Apr 18, 2024

llleohk commented Apr 18, 2024

Mddct commented Apr 20, 2024 •

edited

Loading

llleohk commented Apr 22, 2024

xingchensong commented Apr 25, 2024

llleohk commented Apr 25, 2024

rookie0607 commented Apr 27, 2024

llleohk commented Apr 27, 2024

llleohk commented May 2, 2024

fclearner commented May 6, 2024

llleohk commented May 7, 2024

llleohk commented May 20, 2024

MXuer commented May 21, 2024 •

edited

Loading

llleohk commented May 21, 2024

MXuer commented May 21, 2024

[transformer] Add moe_noisy_gate #2495

Are you sure you want to change the base?

[transformer] Add moe_noisy_gate #2495

Conversation

llleohk commented Apr 18, 2024

xingchensong commented Apr 18, 2024

xingchensong commented Apr 18, 2024

llleohk commented Apr 18, 2024

Mddct commented Apr 18, 2024

xingchensong Apr 18, 2024

Choose a reason for hiding this comment

llleohk Apr 18, 2024

Choose a reason for hiding this comment

llleohk commented Apr 18, 2024

Mddct commented Apr 20, 2024 • edited Loading

llleohk commented Apr 22, 2024

xingchensong commented Apr 25, 2024

llleohk commented Apr 25, 2024

rookie0607 commented Apr 27, 2024

llleohk commented Apr 27, 2024

llleohk commented May 2, 2024

fclearner commented May 6, 2024

llleohk commented May 7, 2024

llleohk commented May 20, 2024

MXuer commented May 21, 2024 • edited Loading

llleohk commented May 21, 2024

MXuer commented May 21, 2024

Mddct commented Apr 20, 2024 •

edited

Loading

MXuer commented May 21, 2024 •

edited

Loading