zipformer BF16 training recipe #1700

marcoyang1998 · 2024-07-24T07:20:57Z

This PR shows how to enable bf16 training of Zipformer. It's not intended for merging at this moment.

Advantages of using bf16 training:

BF16 uses less memory than automatic mixed precision (AMP), and is theoretically faster than AMP (no conversion between fp32 and fp16)
BF16 training should be more stable due to its larger dynamic range (so you won't meet the grad scale too small problem if you use bf16 (see RuntimeError: grad_scale is too small when training Large scale Zipformer model (streaming) #1463, additional instruction for the grad_scale is too small error #1550))

Disadvantages of bf16 training:

Limited hardware support (only supports ampere and afterwards, so V100 not supported)
(Possibly) Slightly worse results compared to AMP because bf16 has less numerical accuracy

TODO:

Performance benchmark with AMP on full LibriSpeech
Test exporting the model

marcoyang1998 · 2024-07-24T07:27:27Z

Performance benchmark on LibriSpeech train-clean-100, all models are trained with 2 A100 GPUs. WERs are obtained with greedy search.

Training config	WER	peak GPU memory	Time per epoch
AMP, max_duration=1000	6.08/15.59	20459MB	8min
full bf16, max_duration=1000	6.11/15.72	15473MB	8min

The WERs are about the same(AMP slightly better than bf16), but bf16 training consumes much less GPU memory. The training speed is the same, but we will have to see on larger max-duration and larger model size.

marcoyang1998 · 2024-08-02T02:47:30Z

Experiments on full LibriSpeech. The model is trained for 30 epochs using 4 A100 GPUs. WERs are obtained using modified_beam_search for decoding.

Training config	WER	peak GPU memory	Time per epoch
full bf16, max_duration=1000	2.4/5.31	15392 MB	~39min

There is a notable performance gap (around ~5%) between AMP (2.25/5.06) and full bf16 (2.4/5.31). We will investigate the reason and try to reduce the gap.

marcoyang1998 · 2024-08-08T06:40:03Z

Since AMP also supports bf16 as dtype, I tried this (with minor modification to the original code). See this post for more instructions on how to use AMP. The results are shown below.

Training config	WER	peak GPU memory	Time per epoch
full bf16, max_duration=1000	2.4/5.31	15392 MB	~39min
amp bf16, max_duration=1000	2.15/5.1	16169 MB	~42min

So AMP+bf16 training achieves better WER than full bf16, while the peak GPU memory and time per epoch slightly increased.

marcoyang1998 · 2024-08-21T06:45:55Z

Detailed experiment results:

model	test-clean	test-other	comment
amp + fp16	2.27	5.1	--epoch 30 --avg 9, greedy
amp + fp16	2.25	5.06	--epoch 30 --avg 9, modified_beam_search
amp + fp16	2.23	4.96	--epoch 40 --avg 9, greedy
amp + fp16	2.21	4.91	--epoch 40 --avg 16
amp + bf16	2.19	5.17	--epoch 30 --avg 13, greedy
amp + bf16	2.15	5.1	--epoch 30 --avg 13, modified_beam_search
amp + bf16	2.16	5.03	--epoch 40 --avg 13, greedy
amp + bf16	2.15	4.94	--epoch 40 --avg 13, modified_beam_search
full bf16	2.42	5.44	--epoch 30 --avg 15, greedy
full bf16	2.4	5.31	--epoch 30 --avg 15, modified_beam_search
full bf16	2.39	5.44	--epoch 40 --avg 17, greedy
full bf16	2.31	5.35	--epoch 40 --avg 17, modified_beam_search

Accuracy: amp fp16 = amp bf16 > full bf16
Training speech: full bf16 > amp fp16 > amp bf16
Memory usage: amp fp16 = amp bf16 > full bf16

marcoyang1998 · 2024-08-21T06:47:03Z

The models are uploaded to huggingface:

The amp bf16 zipformer-M model: https://huggingface.co/marcoyang/icefall-zipformer-M-librispeech-amp-bf16

The full bf16 zipformer-M model: https://huggingface.co/marcoyang/icefall-zipformer-M-librispeech-bf16

marcoyang1998 · 2024-08-22T06:49:44Z

I decided to only support amp+bf16 training in this PR. Full bf16 training loses too much accuracy.

marcoyang1998 added 5 commits July 23, 2024 23:13

add files

0c29c45

make modifications to support full bf16 training

5a05da8

remove the previous version

f4912b2

minor updates

eb9800e

add decoding script

7692bd4

marcoyang1998 added 3 commits August 8, 2024 10:10

remove decode_bf16 because it's not necessary: we can just use decode.py

d1974be

support AMP training with bf16

a7854dd

minor changes

af67140

minor changes

a288d41

marcoyang1998 added 2 commits August 21, 2024 15:31

remove full bf16 training, performance not good

dc353dc

minor changes

b585e14

minor changes

e94d480

marcoyang1998 merged commit a6c02a4 into k2-fsa:master Aug 23, 2024
278 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zipformer BF16 training recipe #1700

zipformer BF16 training recipe #1700

marcoyang1998 commented Jul 24, 2024 •

edited

Loading

marcoyang1998 commented Jul 24, 2024

marcoyang1998 commented Aug 2, 2024

marcoyang1998 commented Aug 8, 2024

marcoyang1998 commented Aug 21, 2024

marcoyang1998 commented Aug 21, 2024

marcoyang1998 commented Aug 22, 2024

zipformer BF16 training recipe #1700

zipformer BF16 training recipe #1700

Conversation

marcoyang1998 commented Jul 24, 2024 • edited Loading

Advantages of using bf16 training:

Disadvantages of bf16 training:

TODO:

marcoyang1998 commented Jul 24, 2024

marcoyang1998 commented Aug 2, 2024

marcoyang1998 commented Aug 8, 2024

marcoyang1998 commented Aug 21, 2024

marcoyang1998 commented Aug 21, 2024

marcoyang1998 commented Aug 22, 2024

marcoyang1998 commented Jul 24, 2024 •

edited

Loading