Skip to content

Commit

Permalink
Update Frame-VAD doc (#6902)
Browse files Browse the repository at this point in the history
* update fvad doc

Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix typo

Signed-off-by: stevehuang52 <heh@nvidia.com>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
  • Loading branch information
2 people authored and sudhakarsingh27 committed Jun 27, 2023
1 parent 32184bf commit e17b4a5
Show file tree
Hide file tree
Showing 6 changed files with 143 additions and 23 deletions.
12 changes: 9 additions & 3 deletions examples/asr/asr_vad/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,16 @@ There are two types of input
- A manifest passed to `manifest_filepath`,
- A directory containing audios passed to `audio_dir` and also specify `audio_type` (default to `wav`).

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "text"] are required. An example of a manifest file is:
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration"] are required. An example of a manifest file is:
```json
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "a b c d e"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "f g h i j"}
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}
```

If you want to calculate WER, provide `text` in manifest as groundtruth. An example of a manifest file is:
```json
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "hello world"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "hello world"}
```

## Output
Expand Down
97 changes: 80 additions & 17 deletions examples/asr/speech_classification/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,88 @@
# Speech Classification

This directory contains example scripts to train speech classification and voice activity detection models.
This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.

# Model execution overview
## Frame-VAD

The training scripts in this directory execute in the following order. When preparing your own training-from-scratch / fine-tuning scripts, please follow this order for correct training/inference.
The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length.

```mermaid
### Training
```sh
python speech_to_label.py \
--config-path=<path to directory of configs, e.g. "../conf/marblenet">
--config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
trainer.devices=-1 \
trainer.accelerator="gpu" \
strategy="ddp" \
trainer.max_epochs=100
```

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
```
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.


### Inference
python frame_vad_infer.py \
--config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \
dataset=<Path of manifest file containing evaluation data. Audio files should have unique names>

The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```

#### Evaluation
If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
```
or RTTM files:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
```


## Segment-VAD

Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).

### Training
```sh
python speech_to_label.py \
--config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
--config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
trainer.devices=-1 \
trainer.accelerator="gpu" \
strategy="ddp" \
trainer.max_epochs=100
```

graph TD
A[Hydra Overrides + Yaml Config] --> B{Config}
B --> |Init| C[Trainer]
C --> D[ExpManager]
B --> D[ExpManager]
C --> E[Model]
B --> |Init| E[Model]
E --> |Constructor| F(Change Labels)
F --> G(Setup Train + Validation + Test Data loaders)
G --> H(Setup Optimization)
H --> I[Maybe init from pretrained]
I --> J["trainer.fit(model)"]
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63, "label": "0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63, "label": "1"}
```

During restoration of the model, you may pass the Trainer to the restore_from / from_pretrained call, or set it after the model has been initialized by using `model.set_trainer(Trainer)`.

### Inference
```sh
python vad_infer.py \
--config-path="../conf/vad" \
--config-name="vad_inference_postprocessing.yaml"
dataset=<Path of json file of evaluation data. Audio files should have unique names>
```
The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```
7 changes: 7 additions & 0 deletions examples/asr/speech_classification/frame_vad_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@
The manifest json file should have the following format (each line is a Python dictionary):
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}
If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate=True` in config yaml,
and also provide groundtruth in either RTTM files or label strings:
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
or
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
"""

import os
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "0 0 0 1 1 1 1 0 0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
```
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.
Expand Down
17 changes: 16 additions & 1 deletion tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Offline ASR+VAD"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -72,6 +74,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -132,13 +135,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use offline VAD to extract speech segments"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -154,6 +159,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -182,6 +188,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -198,6 +205,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -215,6 +223,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -239,6 +248,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -255,13 +265,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stitch the prediction text of speech segments"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -289,6 +301,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -313,13 +326,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate the performance of offline VAD with ASR "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -374,7 +389,7 @@
"source": [
"# Further Reading\n",
"\n",
"There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the two scripts [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr_vad/speech_to_text_with_vad.py)."
"There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the script [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr_vad/speech_to_text_with_vad.py)."
]
}
],
Expand Down
Loading

0 comments on commit e17b4a5

Please sign in to comment.