-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zipformer wenetspeech #1130
zipformer wenetspeech #1130
Conversation
hi, @pkufool, could you talk about why 'blank penalty' can improve the accuracy ? |
The best results:
The model (Non-streaming): https://huggingface.co/pkufool/icefall-asr-zipformer-wenetspeech-20230615 Comparing with other open-sourced results:
Comparing with our previous results (pruned_transducer_stateless5):
|
Did you use CTC loss during training? |
No |
@csukuangfj I made some changes to |
Agreed. You can use symlinks to avoid additional copies. |
Then, you can have a look at the changes under |
We don't really know why we had to compensate for deletion errors in this particular setup, because we haven't seen this effect in other zipformer examples or in other types of system on this data. If it recurs we may develop a better theory. |
|
||
params.vocab_size = sp.get_piece_size() | ||
token_table = k2.SymbolTable.from_file(params.tokens) | ||
params.vocab_size = num_tokens(token_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be
params.vocab_size = num_tokens(token_table) + 1
+1
is missing.
@pkufool So I noticed that you used 6 GPUs to train the zipformer2, and each epoch costs 22 hours. So what kind of GPU did you use? |
32G nvidia tesla v100 |
I see. Have you noticed that the latest lhoste makes zipformer training much faster than before? Do you have similar experience? |
We didn't train it recently, will try it. Thanks! |
Does the deletion error usually occur at the beginning, middle, or end of the decoded result? |
Hi guys, I add a similar penalty to CTC-based conformer and find that it is really helpful. I guess that this is caused by the training dataset (wenetspeech), in which we can find many low-quality paired data. For more infos, plz see wenet-e2e/wenet#2278 |
This is a very interesting phenomenon, and I believe it's worth our time to delve deeper into the underlying principles together. |
@xingchensong FYI, blank-penalty does not help on zipformer large mode (around 148M params). Yes, very interesting phenomenon. |
@xingchensong What's your model size, we found that for small & medium zipformer we need blank-penanty, but for large model (more powerful ?) we don't need it. Maybe you can try increase the #param to see if it is also true for your model. |
116.9M trained under unified streaming&non-streaming mode. https://e2qq6pi6j9.feishu.cn/docx/EFpod2n30omXITx08OAcMSjlnxd |
Hi, I used 4 GPU to train the streaming zipformer model (rnnt-loss) using wenetspeech. The parameter setting is the same as the one provided in the huggingface (https://huggingface.co/pkufool/icefall-asr-zipformer-streaming-wenetspeech-20230615/tree/main/logs/training), and the training/validation loss seems good, but the testing results(WER) are not so good as the pre-trained model, about 1-4% WER larger then the models provided in the huggingface in each epoch. For example, Epoch 9(1 average, chunk-size=32; left-context-frames=256), but I have got 19.6% WER for MEETING test set. |
What value of max-duration are you using? For the pretrained model, what's the result at epoch 9 with the decoding setup: 1 average, chunk-size=32; left-context-frames=256? |
This is the wenetspeech recipe on the latest zipformer model (modeling with characters).
Non streaming model
The training command (use the default medium size model)
Best results for each epoch
The influence of blank-penalty (greedy search result at epoch 12)
Streaming model
The training command (use the default medium size model)
Best results for each epoch (--chunk-size=16; --left-context-frames=128)
The influence of blank-penalty (greedy search result at epoch 12, chunk-size=32,left-context-frames=128)
The decoding result for different latency (greedy search results)
--chunk-size=16; --left-context-frames=64
--chunk-size=32; --left-context-frames=128
--chunk-size=64; --left-context-frames=256