k2-fsa · csukuangfj · Nov 26, 2022 · Nov 20, 2022 · Nov 20, 2022 · Nov 20, 2022
diff --git a/egs/ami/ASR/README.md b/egs/ami/ASR/README.md
@@ -0,0 +1,48 @@
+# AMI
+
+This is an ASR recipe for the AMI corpus. AMI provides recordings from the speaker's
+headset and lapel microphones, and also 2 array microphones containing 8 channels each.
+We pool data in the following 4 ways and train a single model on the pooled data:
+
+(i) individual headset microphone (IHM)
+(ii) IHM with simulated reverb
+(iii) Single distant microphone (SDM)
+(iv) GSS-enhanced array microphones
+
+Speed perturbation and MUSAN noise augmentation are additionally performed on the pooled
+data. Here are the statistics of the combined training data:
+
+```python
+>>> cuts_train.describe()
+Cuts count: 1222053
+Total duration (hh:mm:ss): 905:00:28
+Speech duration (hh:mm:ss): 905:00:28 (99.9%)
+Duration statistics (seconds):
+mean    2.7
+std     2.8
+min     0.0
+25%     0.6
+50%     1.6
+75%     3.8
+99%     12.3
+99.5%   13.9
+99.9%   18.4
+max     36.8
+```
+
+**Note:** This recipe additionally uses [GSS](https://github.com/desh2608/gss) for enhancement
+of far-field array microphones, but this is optional (see `prepare.sh` for details).
+
+## Performance Record
+
+### pruned_transducer_stateless7
+
+The following are decoded using `modified_beam_search`:
+
+| Evaluation set           | dev WER    | test WER |
+|--------------------------|------------|---------|
+| IHM                      |  19.23  | 18.06 |
+| SDM                      |  31.16  | 32.61 |
+| MDM (GSS-enhanced)       |  22.08  | 23.03 |
+
+See [RESULTS](/egs/ami/ASR/RESULTS.md) for details.
diff --git a/egs/ami/ASR/RESULTS.md b/egs/ami/ASR/RESULTS.md
@@ -0,0 +1,93 @@
+## Results
+
+### AMI training results (Pruned Transducer)
+
+#### 2022-11-20
+
+#### Zipformer (pruned_transducer_stateless7)
+
+Zipformer encoder + non-current decoder. The decoder
+contains only an embedding layer, a Conv1d (with kernel size 2) and a linear
+layer (to transform tensor dim).
+
+All the results below are using a single model that is trained by combining the following
+data: IHM, IHM+reverb, SDM, and GSS-enhanced MDM. Speed perturbation and MUSAN noise
+augmentation are applied on top of the pooled data.
+
+**WERs for IHM:**
+
+|                           | dev | test | comment                                  |
+|---------------------------|------------|------------|------------------------------------------|
+| beam search               |  19.18  |  18.00  | --avg-last-n 10 --max-duration 500 |
+| modified beam search      |  19.23  |  18.06  | --avg-last-n 10 --max-duration 500 --beam-size 4 |
+| fast beam search          |  19.46  |  18.35  | --avg-last-n 10 --max-duration 500 --beam-size 4 --max-contexts 4 --max-states 8 |
+
+**WERs for SDM:**
+
+|                           | dev | test | comment                                  |
+|---------------------------|------------|------------|------------------------------------------|
+| beam search               |  31.28  |  32.63  | --avg-last-n 10 --max-duration 500 |
+| modified beam search      |  31.16  |  32.61  | --avg-last-n 10 --max-duration 500 --beam-size 4 |
+| fast beam search          |  31.14  |  32.52  | --avg-last-n 10 --max-duration 500 --beam-size 4 --max-contexts 4 --max-states 8 |
+
+**WERs for GSS-enhanced MDM:**
+
+|                           | dev | test | comment                                  |
+|---------------------------|------------|------------|------------------------------------------|
+| beam search               |  22.09  |  23.03  | --avg-last-n 10 --max-duration 500 |
+| modified beam search      |  22.08  |  23.03  | --avg-last-n 10 --max-duration 500 --beam-size 4 |
+| fast beam search          |  22.45  |  23.38  | --avg-last-n 10 --max-duration 500 --beam-size 4 --max-contexts 4 --max-states 8 |
+
+The training command for reproducing is given below:
+
+```
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+
+./pruned_transducer_stateless7/train.py \
+  --world-size 4 \
+  --num-epochs 15 \
+  --exp-dir pruned_transducer_stateless7/exp \
+  --max-duration 150 \
+  --max-cuts 150 \
+  --prune-range 5 \
+  --lr-factor 5 \
+  --lm-scale 0.25 \
+  --use-fp16 True
+```
+
+The decoding command is:
+```
+# modified beam search
+./pruned_transducer_stateless7/decode.py \
+        --iter 105000 \
+        --avg 10 \
+        --exp-dir ./pruned_transducer_stateless7/exp \
+        --max-duration 500 \
+        --decoding-method modified_beam_search \
+        --beam-size 4
+
+# fast beam search
+./pruned_transducer_stateless7/decode.py \
+        --iter 105000 \
+        --avg 10 \
+        --exp-dir ./pruned_transducer_stateless5/exp \
+        --max-duration 500 \
+        --decoding-method fast_beam_search \
+        --beam 4 \
+        --max-contexts 4 \
+        --max-states 8
+
+# beam search
+./pruned_transducer_stateless7/decode.py \
+        --iter 105000 \
+        --avg 10 \
+        --exp-dir ./pruned_transducer_stateless7/exp \
+        --max-duration 500 \
+        --decoding-method beam_search \
+        --beam-size 4
+```
+
+Pretrained model is available at <https://huggingface.co/desh2608/icefall-asr-ami-pruned-transducer-stateless7>
+
+The tensorboard training log can be found at
+<https://tensorboard.dev/experiment/VH10QOTBTbuYpWx994Onrg/#scalars>
diff --git a/egs/ami/ASR/local/__init__.py b/egs/ami/ASR/local/__init__.py
diff --git a/egs/ami/ASR/local/compute_fbank_ami.py b/egs/ami/ASR/local/compute_fbank_ami.py
@@ -0,0 +1,194 @@
+#!/usr/bin/env python3
+# Copyright    2022  Johns Hopkins University        (authors: Desh Raj)
+#
+# See ../../../../LICENSE for clarification regarding multiple authors
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+"""
+This file computes fbank features of the AMI dataset.
+For the training data, we pool together IHM, reverberated IHM, and GSS-enhanced
+audios. For the test data, we separately prepare IHM, SDM, and GSS-enhanced
+parts (which are the 3 evaluation settings).
+It looks for manifests in the directory data/manifests.
+
+The generated fbank features are saved in data/fbank.
+"""
+import logging
+import math
+from pathlib import Path
+
+import torch
+import torch.multiprocessing
+from lhotse import CutSet, LilcomChunkyWriter
+from lhotse.features.kaldifeat import (
+    KaldifeatFbank,
+    KaldifeatFbankConfig,
+    KaldifeatFrameOptions,
+    KaldifeatMelOptions,
+)
+from lhotse.recipes.utils import read_manifests_if_cached
+
+# Torch's multithreaded behavior needs to be disabled or
+# it wastes a lot of CPU and slow things down.
+# Do this outside of main() in case it needs to take effect
+# even when we are not invoking the main (e.g. when spawning subprocesses).
+torch.set_num_threads(1)
+torch.set_num_interop_threads(1)
+torch.multiprocessing.set_sharing_strategy("file_system")
+
+
+def compute_fbank_ami():
+    src_dir = Path("data/manifests")
+    output_dir = Path("data/fbank")
+
+    sampling_rate = 16000
+    num_mel_bins = 80
+
+    extractor = KaldifeatFbank(
+        KaldifeatFbankConfig(
+            frame_opts=KaldifeatFrameOptions(sampling_rate=sampling_rate),
+            mel_opts=KaldifeatMelOptions(num_bins=num_mel_bins),
+            device="cuda",
+        )
+    )
+
+    logging.info("Reading manifests")
+    manifests_ihm = read_manifests_if_cached(
+        dataset_parts=["train", "dev", "test"],
+        output_dir=src_dir,
+        prefix="ami-ihm",
+        suffix="jsonl.gz",
+    )
+    manifests_sdm = read_manifests_if_cached(
+        dataset_parts=["train", "dev", "test"],
+        output_dir=src_dir,
+        prefix="ami-sdm",
+        suffix="jsonl.gz",
+    )
+    # For GSS we already have cuts so we read them directly.
+    manifests_gss = read_manifests_if_cached(
+        dataset_parts=["train", "dev", "test"],
+        output_dir=src_dir,
+        prefix="ami-gss",
+        suffix="jsonl.gz",
+    )
+
+    def _extract_feats(cuts: CutSet, storage_path: Path, manifest_path: Path) -> None:
+        cuts = cuts + cuts.perturb_speed(0.9) + cuts.perturb_speed(1.1)
+        _ = cuts.compute_and_store_features_batch(
+            extractor=extractor,
+            storage_path=storage_path,
+            manifest_path=manifest_path,
+            batch_duration=5000,
+            num_workers=8,
+            storage_type=LilcomChunkyWriter,
+        )
+
+    logging.info(
+        "Preparing training cuts: IHM + reverberated IHM + SDM + GSS (optional)"
+    )
+
+    logging.info("Processing train split IHM")
+    cuts_ihm = (
+        CutSet.from_manifests(**manifests_ihm["train"])
+        .trim_to_supervisions(keep_overlapping=False, keep_all_channels=False)
+        .modify_ids(lambda x: x + "-ihm")
+    )
+    _extract_feats(
+        cuts_ihm,
+        output_dir / "feats_train_ihm",
+        src_dir / "cuts_train_ihm.jsonl.gz",
+    )
+
+    logging.info("Processing train split IHM + reverberated IHM")
+    cuts_ihm_rvb = cuts_ihm.reverb_rir()
+    _extract_feats(
+        cuts_ihm_rvb,
+        output_dir / "feats_train_ihm_rvb",
+        src_dir / "cuts_train_ihm_rvb.jsonl.gz",
+    )
+
+    logging.info("Processing train split SDM")
+    cuts_sdm = (
+        CutSet.from_manifests(**manifests_sdm["train"])
+        .trim_to_supervisions(keep_overlapping=False)
+        .modify_ids(lambda x: x + "-sdm")
+    )
+    _extract_feats(
+        cuts_sdm,
+        output_dir / "feats_train_sdm",
+        src_dir / "cuts_train_sdm.jsonl.gz",
+    )
+
+    logging.info("Processing train split GSS")
+    cuts_gss = (
+        CutSet.from_manifests(**manifests_gss["train"])
+        .trim_to_supervisions(keep_overlapping=False)
+        .modify_ids(lambda x: x + "-gss")
+    )
+    _extract_feats(
+        cuts_gss,
+        output_dir / "feats_train_gss",
+        src_dir / "cuts_train_gss.jsonl.gz",
+    )
+
+    logging.info("Preparing test cuts: IHM, SDM, GSS (optional)")
+    for split in ["dev", "test"]:
+        logging.info(f"Processing {split} IHM")
+        cuts_ihm = (
+            CutSet.from_manifests(**manifests_ihm[split])
+            .trim_to_supervisions(keep_overlapping=False, keep_all_channels=False)
+            .compute_and_store_features_batch(
+                extractor=extractor,
+                storage_path=output_dir / f"feats_{split}_ihm",
+                manifest_path=src_dir / f"cuts_{split}_ihm.jsonl.gz",
+                batch_duration=5000,
+                num_workers=8,
+                storage_type=LilcomChunkyWriter,
+            )
+        )
+        logging.info(f"Processing {split} SDM")
+        cuts_sdm = (
+            CutSet.from_manifests(**manifests_sdm[split])
+            .trim_to_supervisions(keep_overlapping=False)
+            .compute_and_store_features_batch(
+                extractor=extractor,
+                storage_path=output_dir / f"feats_{split}_sdm",
+                manifest_path=src_dir / f"cuts_{split}_sdm.jsonl.gz",
+                batch_duration=500,
+                num_workers=4,
+                storage_type=LilcomChunkyWriter,
+            )
+        )
+        logging.info(f"Processing {split} GSS")
+        cuts_gss = (
+            CutSet.from_manifests(**manifests_gss[split])
+            .trim_to_supervisions(keep_overlapping=False)
+            .compute_and_store_features_batch(
+                extractor=extractor,
+                storage_path=output_dir / f"feats_{split}_gss",
+                manifest_path=src_dir / f"cuts_{split}_gss.jsonl.gz",
+                batch_duration=500,
+                num_workers=4,
+                storage_type=LilcomChunkyWriter,
+            )
+        )
+
+
+if __name__ == "__main__":
+    formatter = "%(asctime)s %(levelname)s [%(filename)s:%(lineno)d] %(message)s"
+    logging.basicConfig(format=formatter, level=logging.INFO)
+
+    compute_fbank_ami()