Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Frame-VAD doc #6902

Merged
merged 4 commits into from
Jun 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 9 additions & 3 deletions examples/asr/asr_vad/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,16 @@ There are two types of input
- A manifest passed to `manifest_filepath`,
- A directory containing audios passed to `audio_dir` and also specify `audio_type` (default to `wav`).

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "text"] are required. An example of a manifest file is:
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration"] are required. An example of a manifest file is:
```json
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "a b c d e"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "f g h i j"}
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}
```

If you want to calculate WER, provide `text` in manifest as groundtruth. An example of a manifest file is:
```json
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "hello world"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "hello world"}
```

## Output
Expand Down
97 changes: 80 additions & 17 deletions examples/asr/speech_classification/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,88 @@
# Speech Classification

This directory contains example scripts to train speech classification and voice activity detection models.
This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD.

# Model execution overview
## Frame-VAD

The training scripts in this directory execute in the following order. When preparing your own training-from-scratch / fine-tuning scripts, please follow this order for correct training/inference.
The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length.

```mermaid
### Training
```sh
python speech_to_label.py \
--config-path=<path to directory of configs, e.g. "../conf/marblenet">
--config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
trainer.devices=-1 \
trainer.accelerator="gpu" \
strategy="ddp" \
trainer.max_epochs=100
```

The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
```
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.
fayejf marked this conversation as resolved.
Show resolved Hide resolved


### Inference
python frame_vad_infer.py \
--config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \
dataset=<Path of manifest file containing evaluation data. Audio files should have unique names>

The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```

#### Evaluation
If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
```
or RTTM files:
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}
```


## Segment-VAD

Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default).

### Training
```sh
python speech_to_label.py \
--config-path=<path to dir of configs, e.g. "../conf/marblenet"> \
--config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \
trainer.devices=-1 \
trainer.accelerator="gpu" \
strategy="ddp" \
trainer.max_epochs=100
```

graph TD
A[Hydra Overrides + Yaml Config] --> B{Config}
B --> |Init| C[Trainer]
C --> D[ExpManager]
B --> D[ExpManager]
C --> E[Model]
B --> |Init| E[Model]
E --> |Constructor| F(Change Labels)
F --> G(Setup Train + Validation + Test Data loaders)
G --> H(Setup Optimization)
H --> I[Maybe init from pretrained]
I --> J["trainer.fit(model)"]
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63, "label": "0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63, "label": "1"}
```

During restoration of the model, you may pass the Trainer to the restore_from / from_pretrained call, or set it after the model has been initialized by using `model.set_trainer(Trainer)`.

### Inference
```sh
python vad_infer.py \
--config-path="../conf/vad" \
--config-name="vad_inference_postprocessing.yaml"
dataset=<Path of json file of evaluation data. Audio files should have unique names>
```
The manifest json file should have the following format (each line is a Python dictionary):
```
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000}
```
7 changes: 7 additions & 0 deletions examples/asr/speech_classification/frame_vad_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@
The manifest json file should have the following format (each line is a Python dictionary):
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}

If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate=True` in config yaml,
and also provide groundtruth in either RTTM files or label strings:
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"}
or
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"}

"""

import os
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is:
```
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "0 0 0 1 1 1 1 0 0"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"}
```
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1".
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame.
Expand Down
17 changes: 16 additions & 1 deletion tutorials/asr/Offline_ASR_with_VAD_for_CTC_models.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,13 +50,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Offline ASR+VAD"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -72,6 +74,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -132,13 +135,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use offline VAD to extract speech segments"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -154,6 +159,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -182,6 +188,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -198,6 +205,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -215,6 +223,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -239,6 +248,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -255,13 +265,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stitch the prediction text of speech segments"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -289,6 +301,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
Expand All @@ -313,13 +326,15 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluate the performance of offline VAD with ASR "
]
},
{
"attachments": {},
fayejf marked this conversation as resolved.
Show resolved Hide resolved
"cell_type": "markdown",
"metadata": {},
"source": [
Expand Down Expand Up @@ -374,7 +389,7 @@
"source": [
"# Further Reading\n",
"\n",
"There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the two scripts [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr_vad/speech_to_text_with_vad.py)."
"There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the script [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr_vad/speech_to_text_with_vad.py)."
]
}
],
Expand Down
Loading