-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* update fvad doc Signed-off-by: stevehuang52 <heh@nvidia.com> * fix typo Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com> Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
- Loading branch information
1 parent
32184bf
commit e17b4a5
Showing
6 changed files
with
143 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,88 @@ | ||
# Speech Classification | ||
|
||
This directory contains example scripts to train speech classification and voice activity detection models. | ||
This directory contains example scripts to train speech classification and voice activity detection models. There are two types of VAD models: Frame-VAD and Segment-VAD. | ||
|
||
# Model execution overview | ||
## Frame-VAD | ||
|
||
The training scripts in this directory execute in the following order. When preparing your own training-from-scratch / fine-tuning scripts, please follow this order for correct training/inference. | ||
The frame-level VAD model predicts for each frame of the audio whether it has speech or not. For example, with the default config file (`../conf/marblenet/marblenet_3x2x64_20ms.yaml`), the model provides a probability for each frame of 20ms length. | ||
|
||
```mermaid | ||
### Training | ||
```sh | ||
python speech_to_label.py \ | ||
--config-path=<path to directory of configs, e.g. "../conf/marblenet"> | ||
--config-name=<name of config without .yaml, e.g. "marblenet_3x2x64_20ms"> \ | ||
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \ | ||
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \ | ||
trainer.devices=-1 \ | ||
trainer.accelerator="gpu" \ | ||
strategy="ddp" \ | ||
trainer.max_epochs=100 | ||
``` | ||
|
||
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is: | ||
``` | ||
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "label": "0 1 0 0 1"} | ||
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "label": "0 0 0 1 1 1 1 0 0"} | ||
``` | ||
For example, if you have a 1s audio file, you'll need to have 50 frame labels in the manifest entry like "0 0 0 0 1 1 0 1 .... 0 1". | ||
However, shorter label strings are also supported for smaller file sizes. For example, you can prepare the `label` in 40ms frame, and the model will properly repeat the label for each 20ms frame. | ||
|
||
|
||
### Inference | ||
python frame_vad_infer.py \ | ||
--config-path="../conf/vad" --config-name="frame_vad_infer_postprocess" \ | ||
dataset=<Path of manifest file containing evaluation data. Audio files should have unique names> | ||
|
||
The manifest json file should have the following format (each line is a Python dictionary): | ||
``` | ||
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000} | ||
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000} | ||
``` | ||
|
||
#### Evaluation | ||
If you want to evaluate tne model's AUROC and DER performance, you need to set `evaluate: True` in config yaml (e.g., `../conf/vad/frame_vad_infer_postprocess.yaml`), and also provide groundtruth in label strings: | ||
``` | ||
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "label": "0 1 0 0 0 1 1 1 0"} | ||
``` | ||
or RTTM files: | ||
``` | ||
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000, "rttm_filepath": "/path/to/rttm_file1.rttm"} | ||
``` | ||
|
||
|
||
## Segment-VAD | ||
|
||
Segment-level VAD predicts a single label for each segment of audio (e.g., 0.63s by default). | ||
|
||
### Training | ||
```sh | ||
python speech_to_label.py \ | ||
--config-path=<path to dir of configs, e.g. "../conf/marblenet"> \ | ||
--config-name=<name of config without .yaml, e.g., "marblenet_3x2x64"> \ | ||
model.train_ds.manifest_filepath="[<path to train manifest1>,<path to train manifest2>]" \ | ||
model.validation_ds.manifest_filepath=["<path to val manifest1>","<path to val manifest2>"] \ | ||
trainer.devices=-1 \ | ||
trainer.accelerator="gpu" \ | ||
strategy="ddp" \ | ||
trainer.max_epochs=100 | ||
``` | ||
|
||
graph TD | ||
A[Hydra Overrides + Yaml Config] --> B{Config} | ||
B --> |Init| C[Trainer] | ||
C --> D[ExpManager] | ||
B --> D[ExpManager] | ||
C --> E[Model] | ||
B --> |Init| E[Model] | ||
E --> |Constructor| F(Change Labels) | ||
F --> G(Setup Train + Validation + Test Data loaders) | ||
G --> H(Setup Optimization) | ||
H --> I[Maybe init from pretrained] | ||
I --> J["trainer.fit(model)"] | ||
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration", "label"] are required. An example of a manifest file is: | ||
``` | ||
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 0.63, "label": "0"} | ||
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 0.63, "label": "1"} | ||
``` | ||
|
||
During restoration of the model, you may pass the Trainer to the restore_from / from_pretrained call, or set it after the model has been initialized by using `model.set_trainer(Trainer)`. | ||
|
||
### Inference | ||
```sh | ||
python vad_infer.py \ | ||
--config-path="../conf/vad" \ | ||
--config-name="vad_inference_postprocessing.yaml" | ||
dataset=<Path of json file of evaluation data. Audio files should have unique names> | ||
``` | ||
The manifest json file should have the following format (each line is a Python dictionary): | ||
``` | ||
{"audio_filepath": "/path/to/audio_file1.wav", "offset": 0, "duration": 10000} | ||
{"audio_filepath": "/path/to/audio_file2.wav", "offset": 0, "duration": 10000} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.