Skip to content

Commit f28c05f

Browse files
Documentation for adapter fine-tuning (k2-fsa#1545)
1 parent eb132da commit f28c05f

File tree

2 files changed

+226
-0
lines changed

2 files changed

+226
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
Finetune from a pre-trained Zipformer model with adapters
2+
=========================================================
3+
4+
This tutorial shows you how to fine-tune a pre-trained **Zipformer**
5+
transducer model on a new dataset with adapters.
6+
Adapters are compact and efficient module that can be integrated into a pre-trained model
7+
to improve the model's performance on a new domain. Adapters are injected
8+
between different modules in the well-trained neural network. During training, only the parameters
9+
in the adapters will be updated. It achieves competitive performance
10+
while requiring much less GPU memory than full fine-tuning. For more details about adapters,
11+
please refer to the original `paper <https://arxiv.org/pdf/1902.00751.pdf#/>`_ for more details.
12+
13+
.. HINT::
14+
15+
We assume you have read the page :ref:`install icefall` and have setup
16+
the environment for ``icefall``.
17+
18+
.. HINT::
19+
20+
We recommend you to use a GPU or several GPUs to run this recipe
21+
22+
For illustration purpose, we fine-tune the Zipformer transducer model
23+
pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your
24+
own data for fine-tuning if you create a manifest for your new dataset.
25+
26+
Data preparation
27+
----------------
28+
29+
Please follow the instructions in the `GigaSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR>`_
30+
to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.
31+
32+
33+
Model preparation
34+
-----------------
35+
36+
We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The
37+
checkpoint of the model can be downloaded via the following command:
38+
39+
.. code-block:: bash
40+
41+
$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
42+
$ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
43+
$ git lfs pull --include "pretrained.pt"
44+
$ ln -s pretrained.pt epoch-99.pt
45+
$ cd ../data/lang_bpe_500
46+
$ git lfs pull --include bpe.model
47+
$ cd ../../..
48+
49+
Before fine-tuning, let's test the model's WER on the new domain. The following command performs
50+
decoding on the GigaSpeech test sets:
51+
52+
.. code-block:: bash
53+
54+
./zipformer/decode_gigaspeech.py \
55+
--epoch 99 \
56+
--avg 1 \
57+
--exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
58+
--use-averaged-model 0 \
59+
--max-duration 1000 \
60+
--decoding-method greedy_search
61+
62+
You should see the following numbers:
63+
64+
.. code-block::
65+
66+
For dev, WER of different settings are:
67+
greedy_search 20.06 best for dev
68+
69+
For test, WER of different settings are:
70+
greedy_search 19.27 best for test
71+
72+
73+
Fine-tune with adapter
74+
----------------------
75+
76+
We insert 4 adapters with residual connection in each ``Zipformer2EncoderLayer``.
77+
The original model parameters remain untouched during training and only the parameters of
78+
the adapters are updated. The following command starts a fine-tuning experiment with adapters:
79+
80+
.. code-block:: bash
81+
82+
$ do_finetune=1
83+
$ use_adapters=1
84+
$ adapter_dim=8
85+
86+
$ ./zipformer_adapter/train.py \
87+
--world-size 2 \
88+
--num-epochs 20 \
89+
--start-epoch 1 \
90+
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
91+
--use-fp16 1 \
92+
--base-lr 0.045 \
93+
--use-adapters $use_adapters --adapter-dim $adapter_dim \
94+
--bpe-model data/lang_bpe_500/bpe.model \
95+
--do-finetune $do_finetune \
96+
--master-port 13022 \
97+
--finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
98+
--max-duration 1000
99+
100+
The following arguments are related to fine-tuning:
101+
102+
- ``--do-finetune``
103+
If True, do fine-tuning by initializing the model from a pre-trained checkpoint.
104+
**Note that if you want to resume your fine-tuning experiment from certain epochs, you
105+
need to set this to False.**
106+
107+
- ``use-adapters``
108+
If adapters are used during fine-tuning.
109+
110+
- ``--adapter-dim``
111+
The bottleneck dimension of the adapter module. Typically a small number.
112+
113+
You should notice that in the training log, the total number of trainale parameters is shown:
114+
115+
.. code-block::
116+
117+
2024-02-22 21:22:03,808 INFO [train.py:1277] A total of 761344 trainable parameters (1.148% of the whole model)
118+
119+
The trainable parameters only makes up 1.15% of the entire model parameters, so the training will be much faster
120+
and requires less memory than full fine-tuning.
121+
122+
123+
Decoding
124+
--------
125+
126+
After training, let's test the WERs. To test the WERs on the GigaSpeech set,
127+
you can execute the following command:
128+
129+
.. code-block:: bash
130+
131+
$ epoch=20
132+
$ avg=10
133+
$ use_adapters=1
134+
$ adapter_dim=8
135+
136+
% ./zipformer/decode.py \
137+
--epoch $epoch \
138+
--avg $avg \
139+
--use-averaged-model 1 \
140+
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
141+
--max-duration 600 \
142+
--use-adapters $use_adapters \
143+
--adapter-dim $adapter_dim \
144+
--decoding-method greedy_search
145+
146+
You should see the following numbers:
147+
148+
.. code-block::
149+
150+
For dev, WER of different settings are:
151+
greedy_search 15.44 best for dev
152+
153+
For test, WER of different settings are:
154+
greedy_search 15.42 best for test
155+
156+
157+
The WER on test set is improved from 19.27 to 15.42, demonstrating the effectiveness of adapters.
158+
159+
The same model can be used to perform decoding on LibriSpeech test sets. You can deactivate the adapters
160+
to keep the same performance of the original model:
161+
162+
.. code-block:: bash
163+
164+
$ epoch=20
165+
$ avg=1
166+
$ use_adapters=0
167+
$ adapter_dim=8
168+
169+
% ./zipformer/decode.py \
170+
--epoch $epoch \
171+
--avg $avg \
172+
--use-averaged-model 1 \
173+
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
174+
--max-duration 600 \
175+
--use-adapters $use_adapters \
176+
--adapter-dim $adapter_dim \
177+
--decoding-method greedy_search
178+
179+
180+
.. code-block::
181+
182+
For dev, WER of different settings are:
183+
greedy_search 2.23 best for test-clean
184+
185+
For test, WER of different settings are:
186+
greedy_search 4.96 best for test-other
187+
188+
The numbers are the same as reported in `icefall <https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md#normal-scaled-model-number-of-model-parameters-65549011-ie-6555-m>`_. So adapter-based
189+
fine-tuning is also very flexible as the same model can be used for decoding on the original and target domain.
190+
191+
192+
Export the model
193+
----------------
194+
195+
After training, the model can be exported to ``onnx`` format easily using the following command:
196+
197+
.. code-block:: bash
198+
199+
$ use_adapters=1
200+
$ adapter_dim=16
201+
202+
$ ./zipformer_adapter/export-onnx.py \
203+
--tokens icefall-asr-librispeech-zipformer-2023-05-15/data/lang_bpe_500/tokens.txt \
204+
--use-averaged-model 1 \
205+
--epoch 20 \
206+
--avg 10 \
207+
--exp-dir zipformer_adapter/exp_giga_finetune_adapters${use_adapters}_adapter_dim${adapter_dim} \
208+
--use-adapters $use_adapters \
209+
--adapter-dim $adapter_dim \
210+
--num-encoder-layers "2,2,3,4,3,2" \
211+
--downsampling-factor "1,2,4,8,4,2" \
212+
--feedforward-dim "512,768,1024,1536,1024,768" \
213+
--num-heads "4,4,4,8,4,4" \
214+
--encoder-dim "192,256,384,512,384,256" \
215+
--query-head-dim 32 \
216+
--value-head-dim 12 \
217+
--pos-head-dim 4 \
218+
--pos-dim 48 \
219+
--encoder-unmasked-dim "192,192,256,256,256,192" \
220+
--cnn-module-kernel "31,31,15,15,15,31" \
221+
--decoder-dim 512 \
222+
--joiner-dim 512 \
223+
--causal False \
224+
--chunk-size "16,32,64,-1" \
225+
--left-context-frames "64,128,256,-1"

docs/source/recipes/Finetune/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,4 @@ data to improve the performance on new domains.
1313
:caption: Table of Contents
1414

1515
from_supervised/finetune_zipformer
16+
adapter/finetune_adapter

0 commit comments

Comments
 (0)