Skip to content

Commit 5df24c1

Browse files
yuekaizhangJinZr
andauthored
Whisper large fine-tuning on wenetspeech, mutli-hans-zh (k2-fsa#1483)
* add whisper fbank for wenetspeech * add whisper fbank for other dataset * add str to bool * add decode for wenetspeech * add requirments.txt * add original model decode with 30s * test feature extractor speed * add aishell2 feat * change compute feature batch * fix overwrite * fix executor * regression * add kaldifeatwhisper fbank * fix io issue * parallel jobs * use multi machines * add wenetspeech fine-tune scripts * add monkey patch codes * remove useless file * fix subsampling factor * fix too long audios * add remove long short * fix whisper version to support multi batch beam * decode all wav files * remove utterance more than 30s in test_net * only test net * using soft links * add kespeech whisper feats * fix index error * add manifests for whisper * change to licomchunky writer * add missing option * decrease cpu usage * add speed perturb for kespeech * fix kespeech speed perturb * add dataset * load checkpoint from specific path * add speechio * add speechio results --------- Co-authored-by: zr_jin <peter.jin.cn@gmail.com>
1 parent cdb3fb5 commit 5df24c1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

64 files changed

+7844
-129
lines changed

egs/aishell/ASR/RESULTS.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ It's reworked Zipformer with Pruned RNNT loss, trained with Byte-level BPE, `voc
7575
| fast beam search | 4.43 | 4.17 | --epoch 40 --avg 10 |
7676

7777
```bash
78-
./prepare.sh
78+
./prepare.sh
7979

8080
export CUDA_VISIBLE_DEVICES="0,1"
8181

egs/aishell2/ASR/RESULTS.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Results
22

3-
### Aishell2 char-based training results
3+
### Aishell2 char-based training results
44

55
#### Pruned transducer stateless 5
66

egs/aishell2/ASR/local/compute_fbank_aishell2.py

+28-8
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,14 @@
2929
from pathlib import Path
3030

3131
import torch
32-
from lhotse import CutSet, Fbank, FbankConfig, LilcomChunkyWriter
32+
from lhotse import (
33+
CutSet,
34+
Fbank,
35+
FbankConfig,
36+
LilcomChunkyWriter,
37+
WhisperFbank,
38+
WhisperFbankConfig,
39+
)
3340
from lhotse.recipes.utils import read_manifests_if_cached
3441

3542
from icefall.utils import get_executor, str2bool
@@ -42,10 +49,12 @@
4249
torch.set_num_interop_threads(1)
4350

4451

45-
def compute_fbank_aishell2(num_mel_bins: int = 80, perturb_speed: bool = False):
52+
def compute_fbank_aishell2(
53+
num_mel_bins: int = 80, perturb_speed: bool = False, whisper_fbank: bool = False
54+
):
4655
src_dir = Path("data/manifests")
4756
output_dir = Path("data/fbank")
48-
num_jobs = min(15, os.cpu_count())
57+
num_jobs = min(8, os.cpu_count())
4958

5059
dataset_parts = (
5160
"train",
@@ -68,8 +77,12 @@ def compute_fbank_aishell2(num_mel_bins: int = 80, perturb_speed: bool = False):
6877
list(manifests.keys()),
6978
dataset_parts,
7079
)
71-
72-
extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
80+
if whisper_fbank:
81+
extractor = WhisperFbank(
82+
WhisperFbankConfig(num_filters=num_mel_bins, device="cuda")
83+
)
84+
else:
85+
extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
7386

7487
with get_executor() as ex: # Initialize the executor only once.
7588
for partition, m in manifests.items():
@@ -82,7 +95,7 @@ def compute_fbank_aishell2(num_mel_bins: int = 80, perturb_speed: bool = False):
8295
supervisions=m["supervisions"],
8396
)
8497
if "train" in partition and perturb_speed:
85-
logging.info(f"Doing speed perturb")
98+
logging.info("Doing speed perturb")
8699
cut_set = (
87100
cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)
88101
)
@@ -111,7 +124,12 @@ def get_args():
111124
default=False,
112125
help="Enable 0.9 and 1.1 speed perturbation for data augmentation. Default: False.",
113126
)
114-
127+
parser.add_argument(
128+
"--whisper-fbank",
129+
type=str2bool,
130+
default=False,
131+
help="Use WhisperFbank instead of Fbank. Default: False.",
132+
)
115133
return parser.parse_args()
116134

117135

@@ -122,5 +140,7 @@ def get_args():
122140

123141
args = get_args()
124142
compute_fbank_aishell2(
125-
num_mel_bins=args.num_mel_bins, perturb_speed=args.perturb_speed
143+
num_mel_bins=args.num_mel_bins,
144+
perturb_speed=args.perturb_speed,
145+
whisper_fbank=args.whisper_fbank,
126146
)

egs/aishell2/ASR/prepare.sh

+10
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,16 @@ if [ $stage -le 3 ] && [ $stop_stage -ge 3 ]; then
108108
fi
109109
fi
110110

111+
whisper_mel_bins=80
112+
if [ $stage -le 30 ] && [ $stop_stage -ge 30 ]; then
113+
log "Stage 30: Compute whisper fbank for aishell2"
114+
if [ ! -f data/fbank/.aishell2.whisper.done ]; then
115+
mkdir -p data/fbank
116+
./local/compute_fbank_aishell2.py --perturb-speed ${perturb_speed} --num-mel-bins ${whisper_mel_bins} --whisper-fbank true
117+
touch data/fbank/.aishell2.whisper.done
118+
fi
119+
fi
120+
111121
if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
112122
log "Stage 4: Compute fbank for musan"
113123
if [ ! -f data/fbank/.msuan.done ]; then

egs/aishell4/ASR/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
This recipe contains some various ASR models trained with Aishell4 (including S, M and L three subsets).
55

6-
The AISHELL-4 is a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenarios. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, the accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks.
6+
The AISHELL-4 is a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenarios. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, the accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks.
77

88
(From [Open Speech and Language Resources](https://www.openslr.org/111/))
99

egs/aishell4/ASR/local/compute_fbank_aishell4.py

+29-8
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,14 @@
2929
from pathlib import Path
3030

3131
import torch
32-
from lhotse import ChunkedLilcomHdf5Writer, CutSet, Fbank, FbankConfig
32+
from lhotse import (
33+
CutSet,
34+
Fbank,
35+
FbankConfig,
36+
LilcomChunkyWriter,
37+
WhisperFbank,
38+
WhisperFbankConfig,
39+
)
3340
from lhotse.recipes.utils import read_manifests_if_cached
3441

3542
from icefall.utils import get_executor, str2bool
@@ -42,10 +49,12 @@
4249
torch.set_num_interop_threads(1)
4350

4451

45-
def compute_fbank_aishell4(num_mel_bins: int = 80, perturb_speed: bool = False):
52+
def compute_fbank_aishell4(
53+
num_mel_bins: int = 80, perturb_speed: bool = False, whisper_fbank: bool = False
54+
):
4655
src_dir = Path("data/manifests/aishell4")
4756
output_dir = Path("data/fbank")
48-
num_jobs = min(15, os.cpu_count())
57+
num_jobs = min(8, os.cpu_count())
4958

5059
dataset_parts = (
5160
"train_S",
@@ -70,7 +79,12 @@ def compute_fbank_aishell4(num_mel_bins: int = 80, perturb_speed: bool = False):
7079
dataset_parts,
7180
)
7281

73-
extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
82+
if whisper_fbank:
83+
extractor = WhisperFbank(
84+
WhisperFbankConfig(num_filters=num_mel_bins, device="cuda")
85+
)
86+
else:
87+
extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
7488

7589
with get_executor() as ex: # Initialize the executor only once.
7690
for partition, m in manifests.items():
@@ -84,7 +98,7 @@ def compute_fbank_aishell4(num_mel_bins: int = 80, perturb_speed: bool = False):
8498
supervisions=m["supervisions"],
8599
)
86100
if "train" in partition and perturb_speed:
87-
logging.info(f"Doing speed perturb")
101+
logging.info("Doing speed perturb")
88102
cut_set = (
89103
cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)
90104
)
@@ -95,7 +109,7 @@ def compute_fbank_aishell4(num_mel_bins: int = 80, perturb_speed: bool = False):
95109
# when an executor is specified, make more partitions
96110
num_jobs=num_jobs if ex is None else 80,
97111
executor=ex,
98-
storage_type=ChunkedLilcomHdf5Writer,
112+
storage_type=LilcomChunkyWriter,
99113
)
100114

101115
logging.info("About splitting cuts into smaller chunks")
@@ -121,7 +135,12 @@ def get_args():
121135
default=False,
122136
help="Enable 0.9 and 1.1 speed perturbation for data augmentation. Default: False.",
123137
)
124-
138+
parser.add_argument(
139+
"--whisper-fbank",
140+
type=str2bool,
141+
default=False,
142+
help="Use WhisperFbank instead of Fbank. Default: False.",
143+
)
125144
return parser.parse_args()
126145

127146

@@ -132,5 +151,7 @@ def get_args():
132151

133152
args = get_args()
134153
compute_fbank_aishell4(
135-
num_mel_bins=args.num_mel_bins, perturb_speed=args.perturb_speed
154+
num_mel_bins=args.num_mel_bins,
155+
perturb_speed=args.perturb_speed,
156+
whisper_fbank=args.whisper_fbank,
136157
)

egs/aishell4/ASR/prepare.sh

+15-14
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
66
set -eou pipefail
77

88
stage=-1
9-
stop_stage=100
9+
stop_stage=7
1010
perturb_speed=true
1111

1212

@@ -76,11 +76,21 @@ if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
7676
fi
7777

7878
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
79-
log "Stage 2: Process aishell4"
79+
log "Stage 2: Compute fbank for aishell4"
8080
if [ ! -f data/fbank/aishell4/.fbank.done ]; then
81-
mkdir -p data/fbank/aishell4
81+
mkdir -p data/fbank
8282
./local/compute_fbank_aishell4.py --perturb-speed ${perturb_speed}
83-
touch data/fbank/aishell4/.fbank.done
83+
touch data/fbank/.fbank.done
84+
fi
85+
fi
86+
87+
whisper_mel_bins=80
88+
if [ $stage -le 20 ] && [ $stop_stage -ge 20 ]; then
89+
log "Stage 20: Compute whisper fbank for aishell4"
90+
if [ ! -f data/fbank/aishell4/.fbank.done ]; then
91+
mkdir -p data/fbank
92+
./local/compute_fbank_aishell4.py --perturb-speed ${perturb_speed} --num-mel-bins ${whisper_mel_bins} --whisper-fbank true
93+
touch data/fbank/.fbank.done
8494
fi
8595
fi
8696

@@ -106,16 +116,7 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
106116
fi
107117

108118
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
109-
log "Stage 5: Compute fbank for aishell4"
110-
if [ ! -f data/fbank/.aishell4.done ]; then
111-
mkdir -p data/fbank
112-
./local/compute_fbank_aishell4.py --perturb-speed ${perturb_speed}
113-
touch data/fbank/.aishell4.done
114-
fi
115-
fi
116-
117-
if [ $stage -le 6 ] && [ $stop_stage -ge 6 ]; then
118-
log "Stage 6: Prepare char based lang"
119+
log "Stage 5: Prepare char based lang"
119120
lang_char_dir=data/lang_char
120121
mkdir -p $lang_char_dir
121122

egs/alimeeting/ASR/local/compute_fbank_alimeeting.py

+29-8
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,14 @@
2929
from pathlib import Path
3030

3131
import torch
32-
from lhotse import CutSet, Fbank, FbankConfig, LilcomChunkyWriter
32+
from lhotse import (
33+
CutSet,
34+
Fbank,
35+
FbankConfig,
36+
LilcomChunkyWriter,
37+
WhisperFbank,
38+
WhisperFbankConfig,
39+
)
3340
from lhotse.recipes.utils import read_manifests_if_cached
3441

3542
from icefall.utils import get_executor, str2bool
@@ -42,18 +49,20 @@
4249
torch.set_num_interop_threads(1)
4350

4451

45-
def compute_fbank_alimeeting(num_mel_bins: int = 80, perturb_speed: bool = False):
52+
def compute_fbank_alimeeting(
53+
num_mel_bins: int = 80, perturb_speed: bool = False, whisper_fbank: bool = False
54+
):
4655
src_dir = Path("data/manifests/alimeeting")
4756
output_dir = Path("data/fbank")
48-
num_jobs = min(15, os.cpu_count())
57+
num_jobs = min(8, os.cpu_count())
4958

5059
dataset_parts = (
5160
"train",
5261
"eval",
5362
"test",
5463
)
5564

56-
prefix = "alimeeting"
65+
prefix = "alimeeting-far"
5766
suffix = "jsonl.gz"
5867
manifests = read_manifests_if_cached(
5968
dataset_parts=dataset_parts,
@@ -70,7 +79,12 @@ def compute_fbank_alimeeting(num_mel_bins: int = 80, perturb_speed: bool = False
7079
dataset_parts,
7180
)
7281

73-
extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
82+
if whisper_fbank:
83+
extractor = WhisperFbank(
84+
WhisperFbankConfig(num_filters=num_mel_bins, device="cuda")
85+
)
86+
else:
87+
extractor = Fbank(FbankConfig(num_mel_bins=num_mel_bins))
7488

7589
with get_executor() as ex: # Initialize the executor only once.
7690
for partition, m in manifests.items():
@@ -83,7 +97,7 @@ def compute_fbank_alimeeting(num_mel_bins: int = 80, perturb_speed: bool = False
8397
supervisions=m["supervisions"],
8498
)
8599
if "train" in partition and perturb_speed:
86-
logging.info(f"Doing speed perturb")
100+
logging.info("Doing speed perturb")
87101
cut_set = (
88102
cut_set + cut_set.perturb_speed(0.9) + cut_set.perturb_speed(1.1)
89103
)
@@ -121,7 +135,12 @@ def get_args():
121135
default=False,
122136
help="Enable 0.9 and 1.1 speed perturbation for data augmentation. Default: False.",
123137
)
124-
138+
parser.add_argument(
139+
"--whisper-fbank",
140+
type=str2bool,
141+
default=False,
142+
help="Use the Whisper Fbank feature extractor. Default: False.",
143+
)
125144
return parser.parse_args()
126145

127146

@@ -132,5 +151,7 @@ def get_args():
132151

133152
args = get_args()
134153
compute_fbank_alimeeting(
135-
num_mel_bins=args.num_mel_bins, perturb_speed=args.perturb_speed
154+
num_mel_bins=args.num_mel_bins,
155+
perturb_speed=args.perturb_speed,
156+
whisper_fbank=args.whisper_fbank,
136157
)

egs/alimeeting/ASR/prepare.sh

+16-14
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
66
set -eou pipefail
77

88
stage=-1
9-
stop_stage=100
9+
stop_stage=7
1010
perturb_speed=true
1111

1212
# We assume dl_dir (download dir) contains the following
@@ -66,10 +66,21 @@ if [ $stage -le 1 ] && [ $stop_stage -ge 1 ]; then
6666
fi
6767

6868
if [ $stage -le 2 ] && [ $stop_stage -ge 2 ]; then
69-
log "Stage 2: Process alimeeting"
70-
if [ ! -f data/fbank/alimeeting/.fbank.done ]; then
71-
mkdir -p data/fbank/alimeeting
69+
log "Stage 2: compute fbank for alimeeting"
70+
if [ ! -f data/fbank/.fbank.done ]; then
71+
mkdir -p data/fbank
7272
./local/compute_fbank_alimeeting.py --perturb-speed ${perturb_speed}
73+
touch data/fbank/.fbank.done
74+
fi
75+
fi
76+
77+
whisper_mel_bins=80
78+
if [ $stage -le 20 ] && [ $stop_stage -ge 20 ]; then
79+
log "Stage 20: compute whisper fbank for alimeeting"
80+
if [ ! -f data/fbank/.fbank.done ]; then
81+
mkdir -p data/fbank
82+
./local/compute_fbank_alimeeting.py --perturb-speed ${perturb_speed} --num-mel-bins ${whisper_mel_bins} --whisper-fbank true
83+
touch data/fbank/.fbank.done
7384
fi
7485
fi
7586

@@ -95,16 +106,7 @@ if [ $stage -le 4 ] && [ $stop_stage -ge 4 ]; then
95106
fi
96107

97108
if [ $stage -le 5 ] && [ $stop_stage -ge 5 ]; then
98-
log "Stage 5: Compute fbank for alimeeting"
99-
if [ ! -f data/fbank/.alimeeting.done ]; then
100-
mkdir -p data/fbank
101-
./local/compute_fbank_alimeeting.py --perturb-speed True
102-
touch data/fbank/.alimeeting.done
103-
fi
104-
fi
105-
106-
if [ $stage -le 6 ] && [ $stop_stage -ge 6 ]; then
107-
log "Stage 6: Prepare char based lang"
109+
log "Stage 5: Prepare char based lang"
108110
lang_char_dir=data/lang_char
109111
mkdir -p $lang_char_dir
110112

0 commit comments

Comments
 (0)