-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zipformer BF16 training recipe #1700
Conversation
Performance benchmark on LibriSpeech train-clean-100, all models are trained with 2 A100 GPUs. WERs are obtained with greedy search.
The WERs are about the same(AMP slightly better than bf16), but bf16 training consumes much less GPU memory. The training speed is the same, but we will have to see on larger max-duration and larger model size. |
Experiments on full LibriSpeech. The model is trained for 30 epochs using 4 A100 GPUs. WERs are obtained using
There is a notable performance gap (around ~5%) between AMP (2.25/5.06) and full bf16 (2.4/5.31). We will investigate the reason and try to reduce the gap. |
Since AMP also supports bf16 as dtype, I tried this (with minor modification to the original code). See this post for more instructions on how to use AMP. The results are shown below.
So AMP+bf16 training achieves better WER than full bf16, while the peak GPU memory and time per epoch slightly increased. |
Detailed experiment results:
|
The models are uploaded to huggingface: The amp bf16 zipformer-M model: https://huggingface.co/marcoyang/icefall-zipformer-M-librispeech-amp-bf16 The full bf16 zipformer-M model: https://huggingface.co/marcoyang/icefall-zipformer-M-librispeech-bf16 |
I decided to only support amp+bf16 training in this PR. Full bf16 training loses too much accuracy. |
This PR shows how to enable bf16 training of Zipformer. It's not intended for merging at this moment.
Advantages of using bf16 training:
grad_scale is too small
error #1550))Disadvantages of bf16 training:
TODO: