You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Custom Dataset support + Gentle-based custom dataset preprocessing support (#78)
* Fixed typeerror (torch.index_select received an invalid combination of arguments)
File "synthesis.py", line 137, in <module>
model, text, p=replace_pronunciation_prob, speaker_id=speaker_id, fast=True)
File "synthesis.py", line 66, in tts
sequence, text_positions=text_positions, speaker_ids=speaker_ids)
File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\__init__.py", line 79, in forward
text_positions, frame_positions, input_lengths)
File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\__init__.py", line 116, in forward
text_sequences, lengths=input_lengths, speaker_embed=speaker_embed)
File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "H:\Tensorflow_Study\git\deepvoice3_pytorch\deepvoice3_pytorch\deepvoice3.py", line 75, in forward
x = self.embed_tokens(text_sequences) <- change this to long!
File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 325, in __call__
result = self.forward(*input, **kwargs)
File "H:\envs\pytorch\lib\site-packages\torch\nn\modules\sparse.py", line 103, in forward
self.scale_grad_by_freq, self.sparse
File "H:\envs\pytorch\lib\site-packages\torch\nn\_functions\thnn\sparse.py", line 59, in forward
output = torch.index_select(weight, 0, indices.view(-1))
TypeError: torch.index_select received an invalid combination of arguments - got (�[32;1mtorch.cuda.FloatTensor�[0m, �[32;1mint�[0m, �[31;1mtorch.cuda.IntTensor�[0m), but expected (torch.cuda.FloatTensor source, int dim, torch.cuda.LongTensor index)
changed text_sequence to long, as required by torch.index_select.
* Fixed Nonetype error in collect_features
* requirements.txt fix
* Memory Leakage bugfix + hparams change
* Pre-PR modifications
* Pre-PR modifications 2
* Pre-PR modifications 3
* Post-PR modification
* remove requirements.txt
* num_workers to 1 in train.py
* Windows log filename bugfix
* Revert "Windows log filename bugfix"
This reverts commit 5214c24.
* merge 2
* Windows Filename bugfix
In windows, this causes WinError 123
* Cleanup before PR
* JSON format Metadata support
Supports JSON format for dataset creation. Ensures compatibility with http://github.com/carpedm20/multi-Speaker-tacotron-tensorflow
* Web based Gentle aligner support
* README change + gentle patch
* .gitignore change
gitignore change
* Flake8 Fix
* Post PR commit - Also fixed#5#53 (comment) issue solved in PyTorch 0.4
* Post-PR 2 - .gitignore
Copy file name to clipboardExpand all lines: README.md
+48-5Lines changed: 48 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -23,8 +23,8 @@ A notebook supposed to be executed on https://colab.research.google.com is avail
23
23
- Convolutional sequence-to-sequence model with attention for text-to-speech synthesis
24
24
- Multi-speaker and single speaker versions of DeepVoice3
25
25
- Audio samples and pre-trained models
26
-
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets
27
-
- Language-dependent frontend text processor for English and Japanese
26
+
- Preprocessor for [LJSpeech (en)](https://keithito.com/LJ-Speech-Dataset/), [JSUT (jp)](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and [VCTK](http://homepages.inf.ed.ac.uk/jyamagis/page3/page58/page58.html) datasets, as well as [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow) compatible custom dataset (in JSON format)
27
+
- Language-dependent frontend text processor for English and Japanese
When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in `./data/ljspeech`.
132
132
133
+
#### 1-1. Building custom dataset. (using json_meta)
134
+
Building your own dataset, with metadata in JSON format (compatible with [carpedm20/multi-speaker-tacotron-tensorflow](https://github.com/carpedm20/multi-Speaker-tacotron-tensorflow)) is currently supported.
You may need to modify pre-existing preset JSON file, especially `n_speakers`. For english multispeaker, start with `presets/deepvoice3_vctk.json`.
141
+
142
+
Assuming you have dataset A (Speaker A) and dataset B (Speaker B), each described in the JSON metadata file `./datasets/datasetA/alignment.json` and `./datasets/datasetB/alignment.json`, then you can preprocess data by:
143
+
144
+
```
145
+
python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/datasetB/alignment.json" "./datasets/processed_A+B" --preset=(path to preset json file)
146
+
```
147
+
148
+
#### 1-2. Preprocessing custom english datasets with long silence. (Based on [vctk_preprocess](vctk_preprocess/))
149
+
150
+
Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model.
151
+
(e.g. VCTK, although this is covered in vctk_preprocess)
152
+
153
+
To deal with the problem, `gentle_web_align.py` will
154
+
-**Prepare phoneme alignments for all utterances**
155
+
- Cut silences during preprocessing
156
+
157
+
`gentle_web_align.py` uses [Gentle](https://github.com/lowerquality/gentle), a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in `preprocess.py`. Gentle can be run in Linux/Mac/Windows(via Docker).
158
+
159
+
Preliminary results show that while HTK/festival/merlin-based method in `vctk_preprocess/prepare_vctk_labels.py` works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)
160
+
161
+
Usage:
162
+
(Assuming Gentle is running at `localhost:8567` (Default when not specified))
163
+
1. When sound file and transcript files are saved in separate folders. (e.g. sound files are at `datasetA/wavs` and transcripts are at `datasetA/txts`)
Model checkpoints (.pth) and alignments (.png) are saved in `./checkpoints` directory per 10000 steps by default.
@@ -249,7 +290,9 @@ From my experience, it can get reasonable speech quality very quickly rather tha
249
290
There are two important options used above:
250
291
251
292
-`--restore-parts=<N>`: It specifies where to load model parameters. The differences from the option `--checkpoint=<N>` are 1) `--restore-parts=<N>` ignores all invalid parameters, while `--checkpoint=<N>` doesn't. 2) `--restore-parts=<N>` tell trainer to start from 0-step, while `--checkpoint=<N>` tell trainer to continue from last step. `--checkpoint=<N>` should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
252
-
-`--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
293
+
-`--speaker-id=<N>`: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the `speaker_info.txt` in the dataset.
294
+
295
+
If you are training multi-speaker model, speaker adaptation will only work **when `n_speakers` is identical**.
Copy file name to clipboardExpand all lines: hparams.py
+8Lines changed: 8 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -125,6 +125,14 @@
125
125
# Forced garbage collection probability
126
126
# Use only when MemoryError continues in Windows (Disabled by default)
127
127
#gc_probability = 0.001,
128
+
129
+
# json_meta mode only
130
+
# 0: "use all",
131
+
# 1: "ignore only unmatched_alignment",
132
+
# 2: "fully ignore recognition",
133
+
ignore_recognition_level=2,
134
+
min_text=20, # when dealing with non-dedicated speech dataset(e.g. movie excerpts), setting min_text above 15 is desirable. Can be adjusted by dataset.
135
+
process_only_htk_aligned=False, # if true, data without phoneme alignment file(.lab) will be ignored
0 commit comments