Merge README.md, added function comment

zyingt · Feb 8, 2024 · 8ef137c · 8ef137c
1 parent b4a1d3d
commit 8ef137c
Show file tree

Hide file tree

Showing 4 changed files with 44 additions and 183 deletions.
diff --git a/bins/tts/preprocess.py b/bins/tts/preprocess.py
@@ -88,11 +88,11 @@ def extract_phonme_sequences(dataset, output_path, cfg, dataset_types):
         dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
         with open(dataset_file, "r") as f:
             metadata.extend(json.load(f))
-    phone_extractor.extract_utt_phone_sequence(cfg, metadata)
+    phone_extractor.extract_utt_phone_sequence(dataset, cfg, metadata)
 
 
 def preprocess(cfg, args):
-    """Proprocess raw data of single or multiple datasets (in cfg.dataset)
+    """Preprocess raw data of single or multiple datasets (in cfg.dataset)
 
     Args:
         cfg (dict): dictionary that stores configurations

diff --git a/egs/tts/VITS/README.md b/egs/tts/VITS/README.md
@@ -3,8 +3,8 @@
 [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Spaces-yellow)](https://huggingface.co/spaces/amphion/Text-to-Speech)
 [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/Text-to-Speech)
 
-In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning. The detailed instructions for training [single speaker](#single-speaker-vits) and [multi-speaker](#multi-speaker-vits) VITS can be found below:
-## Single Speaker VITS
+In this recipe, we will show how to train VITS using Amphion's infrastructure. [VITS](https://arxiv.org/abs/2106.06103) is an end-to-end TTS architecture that utilizes conditional variational autoencoder with adversarial learning.
+
 There are four stages in total:
 
 1. Data preparation
@@ -20,7 +20,7 @@ There are four stages in total:
 ## 1. Data Preparation
 
 ### Dataset Download
-You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, LibriTTS, etc. We strongly recommend you use LJSpeech to train TTS model for the first time. How to download dataset is detailed [here](../../datasets/README.md).
+You can use the commonly used TTS dataset to train TTS model, e.g., LJSpeech, VCTK, Hi-Fi TTS, LibriTTS, etc. We strongly recommend using LJSpeech to train single-speaker TTS model for the first time. While for training multi-speaker TTS model for the first time, we would recommend using Hi-Fi TTS. The process of downloading dataset has been detailed [here](../../datasets/README.md).
 
 ### Configuration
 
@@ -29,26 +29,35 @@ After downloading the dataset, you can set the dataset paths in  `exp_config.jso
 ```json
     "dataset": [
         "LJSpeech",
+        //"hifitts"
     ],
     "dataset_path": {
         // TODO: Fill in your dataset path
         "LJSpeech": "[LJSpeech dataset path]",
+        //"hifitts": "[Hi-Fi TTS dataset path]
     },
 ```
 
 ## 2. Features Extraction
 
 ### Configuration
 
-Specify the `processed_dir` and the `log_dir` and for saving the processed data and the checkpoints in `exp_config.json`:
+In `exp_config.json`:<br> Specify the `log_dir` for saving the checkpoints and logs, specify the `processed_dir` for saving processed data. For preprocessing multi-speaker TTS dataset, set `extract_audio` and `use_spkid` to `true`:
 
 ```json
     // TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
     "log_dir": "ckpts/tts",
     "preprocess": {
+        //"extract_audio": true,//set to true for multi-speaker TTS model
+        "use_phone": true,
+        // linguistic features
+        "extract_phone": true,
+        "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
         // TODO: Fill in the output data path. The default value is "Amphion/data"
         "processed_dir": "data",
-        ...
+        "sample_rate": 22050, //target sampling rate
+        "valid_file": "valid.json", //validation set
+        //"use_spkid": true, //set to true for multi-speaker TTS model
     },
 ```
 
@@ -67,11 +76,16 @@ sh egs/tts/VITS/run.sh --stage 1
 ### Configuration
 
 We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on your GPU machines.
+For training multi-speaker TTS model, specify the `n_speakers` according to the number of speakers in your dataset(s) and set `multi_speaker_training` to `true`.
 
-```
-"train": {
-        "batch_size": 16,
-    }
+```json
+  "model": {
+    //"n_speakers": 10 //for multi-speaker TTS model: Fill in the number of speakers according to dataset used. The default value is 0 if not specified.
+  },
+  "train": {
+    "batch_size": 16,
+    //"multi_speaker_training": true, //for multi-speaker TTS model
+  }
 ```
 
 ### Train From Scratch
@@ -139,11 +153,13 @@ For inference, you need to specify the following configurations when running `ru
 | `--infer_expt_dir`    | The experimental directory which contains `checkpoint`                                 | `Amphion/ckpts/tts/[YourExptName]`                                                                                                                                              |
 | `--infer_output_dir`  | The output directory to save inferred audios.                                          | `Amphion/ckpts/tts/[YourExptName]/result`                                                                                                                                       |
 | `--infer_mode`        | The inference mode, e.g., "`single`", "`batch`".                                       | "`single`" to generate a clip of speech, "`batch`" to generate a batch of speech at a time.                                                                                     |
-| `--infer_dataset`     | The dataset used for inference.                                                        | For LJSpeech dataset, the inference dataset would be `LJSpeech`.                                                                                                                |
-| `--infer_testing_set` | The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be  "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from test set as template testing set. |
+| `--infer_dataset`     | The dataset used for inference.                                                        | For LJSpeech dataset, the inference dataset would be `LJSpeech`.<br> For Hi-Fi TTS dataset, the inference dataset would be `hifitts`.                                                                                                              |
+| `--infer_testing_set` | The subset of the inference dataset used for inference, e.g., train, test, golden_test | For LJSpeech dataset, the testing set would be  "`test`" split from LJSpeech at the feature extraction, or "`golden_test`" cherry-picked from test set as template testing set.<br>For Hi-Fi TTS dataset, the testing set would be "`test`" split from Hi-Fi TTS during the feature extraction process. |
 | `--infer_text`        | The text to be synthesized.                                                            | "`This is a clip of generated speech with the given text from a TTS model.`"                                                                                                    |
+| `--infer_speaker_name`        | The target speaker's voice to be  synthesized.<br> (***Note: only applicable to multi-speaker TTS model***)                                                   | For Hi-Fi TTS dataset, the list of available speakers includes: "`hifitts_11614`", "`hifitts_11697`", "`hifitts_12787`", "`hifitts_6097`", "`hifitts_6670`", "`hifitts_6671`", "`hifitts_8051`", "`hifitts_9017`", "`hifitts_9136`", "`hifitts_92`". <br> You may find the list of available speakers from `spk2id.json` file generated in  ```log_dir/[YourExptName]``` that you have specified in `exp_config.json`.                                                                         |
 
 ### Run
+#### Batch inference: 
 For example, if you want to generate speech of all testing set split from LJSpeech, just run:
 
 ```bash
@@ -154,185 +170,28 @@ sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \
     --infer_dataset "LJSpeech" \
     --infer_testing_set "test"
 ```
-
-Or, if you want to generate a single clip of speech from a given text, just run:
-
+The same procedure follows for inferencing on multi-speaker dataset, with ```LJSpeech``` replaced by ```hifitts```.
 ```bash
 sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \
     --infer_expt_dir Amphion/ckpts/tts/[YourExptName] \
     --infer_output_dir Amphion/ckpts/tts/[YourExptName]/result \
-    --infer_mode "single" \
-    --infer_text "This is a clip of generated speech with the given text from a TTS model."
-```
-
-We released a pre-trained Amphion VITS model trained on LJSpeech. So you can download the pre-trained model [here](https://huggingface.co/amphion/vits-ljspeech) and generate speech following the above inference instruction.
-
-## Multi-speaker VITS
-There are four stages in total:
-
-1. Data preparation
-2. Features extraction
-3. Training
-4. Inference
-
-> **NOTE:** You need to run every command of this recipe in the `Amphion` root path:
-> ```bash
-> cd Amphion
-> ```
-
-## 1. Data Preparation
-
-### Dataset Download
-You can use the commonly used multi-speaker TTS dataset to train TTS model, i.e., Hi-Fi TTS, LibriTTS etc. We strongly recommend you use Hi-Fi TTS to train TTS model for the first time. The process of downloading dataset is detailed [here](../../datasets/README.md).
-
-### Configuration
-
-After downloading the dataset, you can set the dataset paths in  `exp_config.json`. Note that you can change the `dataset` list to use your preferred datasets.
-
-```json
-    "dataset": [
-        "hifitts",
-    ],
-    "dataset_path": {
-        // TODO: Fill in your dataset path
-        "hifitts": "[Hi-Fi TTS dataset path]",
-    },
-```
-
-## 2. Features Extraction
-
-### Configuration
-
-In `exp_config.json`, specify the `log_dir` for saving the checkpoints and logs, specify the `processed_dir` for saving processed data,  set `extract_audio` and `use_spkid` to `true`. 
-
-```json
-    // TODO: Fill in the output log path. The default value is "Amphion/ckpts/tts"
-    "log_dir": "ckpts/tts",
-    "preprocess": {
-        "extract_audio": true,
-        "use_phone": true,
-        // linguistic features
-        "extract_phone": true,
-        "phone_extractor": "espeak", // "espeak, pypinyin, pypinyin_initials_finals, lexicon (only for language=en-us right now)"
-        // TODO: Fill in the output data path. The default value is "Amphion/data"
-        "processed_dir": "data",
-        "sample_rate": 24000, //target sampling rate
-        "valid_file": "valid.json", //validation set
-        "use_spkid": true, //true: use speaker id for multi-speaker dataset
-    },
-```
-
-### Run
-
-Run the `run.sh` as the preprocess stage (set  `--stage 1`):
-
-```bash
-sh egs/tts/VITS/run.sh --stage 1
-```
-
-> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "1"`.
-
-## 3. Training
-
-### Configuration
-
-We provide the default hyparameters in the `exp_config.json`. They can work on single NVIDIA-24g GPU. You can adjust them based on your GPU machines. Remember to specify the `n_speakers` according to the number of speakers in your dataset and set `multi_speaker_training` to `true`.
-
-```json
-  "model": {
-    // TODO: Fill in the number of speakers according to dataset used. The default value is 0 if not specified.
-    "n_speakers": 10 
-  },
-  "train": {
-    "batch_size": 16,
-    "multi_speaker_training": true, 
-  }
-```
-
-### Train From Scratch
-
-Run the `run.sh` as the training stage (set  `--stage 2`). Specify a experimental name to run the following command. The tensorboard logs and checkpoints will be saved in `Amphion/ckpts/tts/[YourExptName]`.
-
-```bash
-sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName]
-```
-
-### Train From Existing Source
-
-We support training from existing source for various purposes. You can resume training the model from a checkpoint or fine-tune a model from another checkpoint.
-
-Setting `--resume true`, the training will resume from the **latest checkpoint** from the current `[YourExptName]` by default. For example, if you want to resume training from the latest checkpoint in `Amphion/ckpts/tts/[YourExptName]/checkpoint`, run:
-
-```bash
-sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
-    --resume true
-```
-
-You can also choose a **specific checkpoint** for retraining by `--resume_from_ckpt_path` argument. For example, if you want to resume training from the checkpoint `Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]`, run:
-
-```bash
-sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
-    --resume true
-    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]" \
-```
-
-If you want to **fine-tune from another checkpoint**, just use `--resume_type` and set it to `"finetune"`. For example, If you want to fine-tune the model from the checkpoint `Amphion/ckpts/tts/[AnotherExperiment]/checkpoint/[SpecificCheckpoint]`, run:
-
-
-```bash
-sh egs/tts/VITS/run.sh --stage 2 --name [YourExptName] \
-    --resume true
-    --resume_from_ckpt_path "Amphion/ckpts/tts/[YourExptName]/checkpoint/[SpecificCheckpoint]" \
-    --resume_type "finetune"
+    --infer_mode "batch" \
+    --infer_dataset "hifitts" \
+    --infer_testing_set "test" 
 ```
 
-> **NOTE:** The `--resume_type` is set as `"resume"` in default. It's not necessary to specify it when resuming training.
-> 
-> The difference between `"resume"` and `"finetune"` is that the `"finetune"` will **only** load the pretrained model weights from the checkpoint, while the `"resume"` will load all the training states (including optimizer, scheduler, etc.) from the checkpoint.
-
-Here are some example scenarios to better understand how to use these arguments:
-| Scenario | `--resume` | `--resume_from_ckpt_path` | `--resume_type` |
-| ------ | -------- | ----------------------- | ------------- |
-| You want to train from scratch | no | no | no |
-| The machine breaks down during training and you want to resume training from the latest checkpoint | `true` | no | no |
-| You find the latest model is overfitting and you want to re-train from the checkpoint before | `true` | `SpecificCheckpoint Path` | no |
-| You want to fine-tune a model from another checkpoint | `true` | `SpecificCheckpoint Path` | `"finetune"` |
-
-
-> **NOTE:** The `CUDA_VISIBLE_DEVICES` is set as `"0"` in default. You can change it when running `run.sh` by specifying such as `--gpu "0,1,2,3"`.
-
-
-## 4. Inference
-
-### Configuration
-
-For inference, you need to specify the following configurations when running `run.sh`:
-
-
-| Parameters            | Description                                                                            | Example                                                                                                                                                                         |
-| --------------------- | -------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `--infer_expt_dir`    | The experimental directory which contains `checkpoint`                                 | `Amphion/ckpts/tts/[YourExptName]`                                                                                                                                              |
-| `--infer_output_dir`  | The output directory to save inferred audios.                                          | `Amphion/ckpts/tts/[YourExptName]/result`                                                                                                                                       |
-| `--infer_mode`        | The inference mode, e.g., "`single`", "`batch`".                                       | "`single`" to generate a clip of speech, "`batch`" to generate a batch of speech at a time.                                                                                     |
-| `--infer_dataset`     | The dataset used for inference.                                                        | For Hi-Fi TTS dataset, the inference dataset would be `hifitts`.                                                                                                                 |
-| `--infer_testing_set` | The subset of the inference dataset used for inference, e.g., train, test | For Hi-Fi TTS dataset, the testing set would be "`test`" split from Hi-Fi TTS during the feature extraction process.  |
-| `--infer_text`        | The text to be synthesized.                                                            | "`This is a clip of generated speech with the given text from a TTS model.`"                                                                                                    |
-| `--infer_speaker_name`        | The target speaker's voice to be  synthesized.                                                           | For Hi-Fi TTS dataset, the list of available speakers includes: "`hifitts_11614`", "`hifitts_11697`", "`hifitts_12787`", "`hifitts_6097`", "`hifitts_6670`", "`hifitts_6671`", "`hifitts_8051`", "`hifitts_9017`", "`hifitts_9136`", "`hifitts_92`". <br> You may find the list of available speakers from `spk2id.json` file generated in  ```log_dir/[YourExptName]``` that you have specified in `exp_config.json`.                                                                         |
-
-### Run
-For example, if you want to generate speech from all testing set split from Hi-Fi TTS, just run:
+#### Single text inference: 
+For single-speaker TTS model, if you want to generate a single clip of speech from a given text, just run:
 
 ```bash
 sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \
     --infer_expt_dir Amphion/ckpts/tts/[YourExptName] \
     --infer_output_dir Amphion/ckpts/tts/[YourExptName]/result \
-    --infer_mode "batch" \
-    --infer_dataset "hifitts" \
-    --infer_testing_set "test" 
+    --infer_mode "single" \
+    --infer_text "This is a clip of generated speech with the given text from a TTS model."
 ```
 
-Or, if you want to generate a single clip of speech from a given text, just run:
-
+For multi-speaker TTS model, in addition to the above-mentioned arguments, you need to add ```infer_speaker_name``` argument, and run: 
 ```bash
 sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \
     --infer_expt_dir Amphion/ckpts/tts/[YourExptName] \
@@ -342,7 +201,7 @@ sh egs/tts/VITS/run.sh --stage 3 --gpu "0" \
     --infer_speaker_name "hifitts_92"
 ```
 
-We will release a pre-trained multi-speaker VITS model trained on Hi-Fi TTS soon. Stay tuned!
+We released a pre-trained Amphion VITS model trained on LJSpeech. So you can download the pre-trained model [here](https://huggingface.co/amphion/vits-ljspeech) and generate speech following the above inference instruction. Meanwhile, the pre-trained multi-speaker VITS model trained on Hi-Fi TTS will be released soon. Stay tuned.
 
 
 ```bibtex

diff --git a/processors/phone_extractor.py b/processors/phone_extractor.py
@@ -45,7 +45,7 @@ def __init__(self, cfg, dataset_name=None, phone_symbol_file=None):
             assert cfg.preprocess.lexicon_path != ""
             self.g2p_module = LexiconModule(cfg.preprocess.lexicon_path)
         else:
-            print("No suppert to", cfg.preprocess.phone_extractor)
+            print("No support to", cfg.preprocess.phone_extractor)
             raise
 
     def extract_phone(self, text):
@@ -93,16 +93,17 @@ def save_dataset_phone_symbols_to_table(self):
         phone_symbol_dict.to_file(self.phone_symbols_file)
 
 
-def extract_utt_phone_sequence(cfg, metadata):
+def extract_utt_phone_sequence(dataset, cfg, metadata):
     """
     Extract phone sequence from text
     Args:
+        dataset (str): name of dataset, e.g. opencpop
         cfg: config
         metadata: list of dict, each dict contains "Uid", "Text"
 
     """
 
-    dataset_name = cfg.dataset[0]
+    dataset_name = dataset
 
     # output path
     out_path = os.path.join(