NVIDIA · fayejf · Jun 22, 2023 · Jun 21, 2023 · Jun 22, 2023 · Jun 22, 2023
diff --git a/examples/asr/asr_vad/README.md b/examples/asr/asr_vad/README.md
@@ -8,10 +8,16 @@ There are two types of input
 -  A manifest passed to `manifest_filepath`, 
 -  A directory containing audios passed to `audio_dir` and also specify `audio_type` (default to `wav`).
 
-The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration",  "text"] are required. An example of a manifest file is:
+The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration"] are required. An example of a manifest file is:
 ```json
-{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000,  "text": "a b c d e"}
-{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "text": "f g h i j"}
+{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}
+{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}
+```
+
+If you want to calculate WER, provide `text` in manifest as groundtruth. An example of a manifest file is:
+```json
+{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "hello world"}
+{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "hello world"}
 ```
 
 ## Output

diff --git a/examples/asr/speech_classification/README.md b/examples/asr/speech_classification/README.md
@@ -1,25 +1,88 @@
 # Speech Classification
 
-This directory contains example scripts to train speech classification and voice activity detection models. 
+This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.
 
-# Model execution overview
+## Frame-VAD
 
-The training scripts in this directory execute in the following order. When preparing your own training-from-scratch / fine-tuning scripts, please follow this order for correct training/inference.
+The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length.
 
-```mermaid
+### Training
+```sh
+python speech_to_label.py \
+    --config-path=<path to directory of configs, e.g. "../conf/marblenet">
+    --config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
+    model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
+    model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
+    trainer.devices=-1 \
+    trainer.accelerator="gpu" \
+    strategy="ddp" \
+    trainer.max_epochs=100
+```
+
+The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration",  "label"] are required. An example of a manifest file is:
+```
+{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000,  "label": "0 1 0 0 1"}
+{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "label": "0 0 0 1 1 1 1 0 0"}
+```
+For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
+However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.
+
+
+### Inference
+python frame_vad_infer.py \
+    --config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \
+    dataset=<Path of manifest file containing evaluation data. Audio files should have unique names>
+
+The manifest json file should have the following format (each line is a Python dictionary):
+```
+{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}  
+{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}  
+```
+
+#### Evaluation
+If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings:
+```
+{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
+```
+or RTTM files:
+```
+{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
+```
+
+
+## Segment-VAD
+
+Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).
+
+### Training
+```sh
+python speech_to_label.py \
+    --config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
+    --config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
+    model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
+    model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
+    trainer.devices=-1 \
+    trainer.accelerator="gpu" \
+    strategy="ddp" \
+    trainer.max_epochs=100
+```
 
-graph TD
-    A[Hydra Overrides + Yaml Config] --> B{Config}
-    B --> |Init| C[Trainer]
-    C --> D[ExpManager]
-    B --> D[ExpManager]
-    C --> E[Model]
-    B --> |Init| E[Model]
-    E --> |Constructor| F(Change Labels)
-    F --> G(Setup Train + Validation + Test Data loaders)
-    G --> H(Setup Optimization)
-    H --> I[Maybe init from pretrained]
-    I --> J["trainer.fit(model)"]
+The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration",  "label"] are required. An example of a manifest file is:
+```
+{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63,  "label": "0"}
+{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63,  "label": "1"}
 ```
 
-During restoration of the model, you may pass the Trainer to the restore_from / from_pretrained call, or set it after the model has been initialized by using `model.set_trainer(Trainer)`.
+
+### Inference
+```sh
+python vad_infer.py \
+    --config-path="../conf/vad" \
+    --config-name="vad_inference_postprocessing.yaml"
+    dataset=<Path of json file of evaluation data. Audio files should have unique names>
+```
+The manifest json file should have the following format (each line is a Python dictionary):
+```
+{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}  
+{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}  
+```
diff --git a/examples/asr/speech_classification/frame_vad_infer.py b/examples/asr/speech_classification/frame_vad_infer.py
@@ -26,6 +26,13 @@
 The manifest json file should have the following format (each line is a Python dictionary):
 {"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}  
 {"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}  
+
+If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate=True` in config yaml,
+and also provide groundtruth in either RTTM files or label strings:
+{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
+or
+{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
+
 """
 
 import os

diff --git a/examples/asr/speech_classification/speech_to_frame_label.py b/examples/asr/speech_classification/speech_to_frame_label.py
@@ -32,7 +32,7 @@
 The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration",  "label"] are required. An example of a manifest file is:
 ```
 {"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000,  "label": "0 1 0 0 1"}
-{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "text": "0 0 0 1 1 1 1 0 0"}
+{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000,  "label": "0 0 0 1 1 1 1 0 0"}
 ```
 For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
 However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.

diff --git a/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb b/tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb
@@ -50,13 +50,15 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
                 "# Offline ASR+VAD"
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -72,6 +74,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -132,13 +135,15 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
                 "## Use offline VAD to extract speech segments"
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -154,6 +159,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -182,6 +188,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -198,6 +205,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -215,6 +223,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -239,6 +248,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -255,13 +265,15 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
                 "## Stitch the prediction text of speech segments"
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -289,6 +301,7 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -313,13 +326,15 @@
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
                 "# Evaluate the performance of offline VAD with ASR "
             ]
         },
         {
+            "attachments": {},
             "cell_type": "markdown",
             "metadata": {},
             "source": [
@@ -374,7 +389,7 @@
             "source": [
                 "# Further Reading\n",
                 "\n",
-                "There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the two scripts [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr_vad/speech_to_text_with_vad.py)."
+                "There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the script [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr_vad/speech_to_text_with_vad.py)."
             ]
         }
     ],