Multi-Speaker Tacotron in TensorFlow

TensorFlow implementation of:

Samples audios (in Korean) can be found here.

Prerequisites

Python 3.6+
FFmpeg
Tensorflow 1.3

Usage

1. Install prerequisites

After preparing Tensorflow, install prerequisites with:

pip3 install -r requirements.txt
python -c "import nltk; nltk.download('punkt')"

If you want to synthesize a speech in Korean dicrectly, follow 2-3. Download pre-trained models.

2-1. Generate custom datasets

The datasets directory should look like:

datasets
├── son
│   ├── alignment.json
│   └── audio
│       ├── 1.mp3
│       ├── 2.mp3
│       ├── 3.mp3
│       └── ...
└── YOUR_DATASET
    ├── alignment.json
    └── audio
        ├── 1.mp3
        ├── 2.mp3
        ├── 3.mp3
        └── ...

and YOUR_DATASET/alignment.json should look like:

{
    "./datasets/YOUR_DATASET/audio/001.mp3": "My name is Taehoon Kim.",
    "./datasets/YOUR_DATASET/audio/002.mp3": "The buses aren't the problem.",
    "./datasets/YOUR_DATASET/audio/003.mp3": "They have discovered a new particle.",
}

After you prepare as described, you should genearte preprocessed data with:

python3 -m datasets.generate_data ./datasets/YOUR_DATASET/alignment.json

2-2. Generate Korean datasets

Follow below commands. (explain with son dataset)

To automate an alignment between sounds and texts, prepare GOOGLE_APPLICATION_CREDENTIALS to use Google Speech Recognition API. To get credentials, read this.
```
export GOOGLE_APPLICATION_CREDENTIALS="YOUR-GOOGLE.CREDENTIALS.json"
```
Download speech(or video) from youtube link. Json file has folder name (like "son") and list of youtube video url.
```
python3 -m datasets.youtube_audio_downloader --url_type=video --json_filename=videolist.json
```
New code : split audio by silence + recognition(By using Aibril STT API). 기존 코드에서 침묵 기준으로 나눈 결과는 많은 수정이 필요했고 (문장의 끝이 제대로 안잘림) 따라서 인식 결과도 좋지 않아 수작업이 필요하다. 그래서 raw audio file을 그래도 인식하여 나온 결과(duration, transcript) 를 토대로 audio file을 잘랐다. (Aibril STT 참고)
```
python3 -m recognition.watson --audio_pattern "./datasets/son/audio/*.wav" --method=pydub
```
위의 2번에서 침묵 기준 split와 인식을 수행하고 audio, text 쌍을 json 파일로 만듦. 아래부터는 현재 수정 중 (191120) Finally, generated numpy files which will be used in training.
```
python3 -m datasets.generate_data ./datasets/son/alignment.json
```

Because the automatic generation is extremely naive, the dataset is noisy. However, if you have enough datasets (20+ hours with random initialization or 5+ hours with pretrained model initialization), you can expect an acceptable quality of audio synthesis.

3. Train a model

The important hyperparameters for a models are defined in hparams.py.

(Change cleaners in hparams.py from korean_cleaners to english_cleaners to train with English dataset)

To train a single-speaker model:

python3 train.py --data_path=datasets/son
python3 train.py --data_path=datasets/son --initialize_path=PATH_TO_CHECKPOINT

To train a multi-speaker model:

# after change `model_type` in `hparams.py` to `deepvoice` or `simple`
python3 train.py --data_path=datasets/son1,datasets/son2

To restart a training from previous experiments such as logs/son-20171015:

python3 train.py --data_path=datasets/son --load_path logs/son-20171015

If you don't have good and enough (10+ hours) dataset, it would be better to use --initialize_path to use a well-trained model as initial parameters.

4. Synthesize audio

You can train your own models with:

python3 app.py --load_path logs/son-20171015 --num_speakers=1

or generate audio directly with:

python3 synthesizer.py --load_path logs/son-20171015 --text "이거 실화냐?"

4-1. Synthesizing non-korean(english) audio

For generating non-korean audio, you must set the argument --is_korean False.

python3 app.py --load_path logs/LJSpeech_1_0-20180108 --num_speakers=1 --is_korean=False
python3 synthesizer.py --load_path logs/LJSpeech_1_0-20180108 --text="Winter is coming." --is_korean=False

Results

Training attention on single speaker model:

Training attention on multi speaker model:

Disclaimer

This is not an official DEVSISTERS product. This project is not responsible for misuse or for any damage that you may cause. You agree that you use this software at your own risk.

References

Author

Taehoon Kim / @carpedm20

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
audio		audio
datasets		datasets
models		models
recognition		recognition
scripts		scripts
text		text
utils		utils
web		web
.gitignore		.gitignore
DISCLAIMER		DISCLAIMER
LICENSE		LICENSE
README.md		README.md
app.py		app.py
data_preprocessing.sh		data_preprocessing.sh
download.py		download.py
eval.py		eval.py
hparams.py		hparams.py
requirements.txt		requirements.txt
run.sh		run.sh
synthesizer.py		synthesizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Speaker Tacotron in TensorFlow

Prerequisites

Usage

1. Install prerequisites

2-1. Generate custom datasets

2-2. Generate Korean datasets

3. Train a model

4. Synthesize audio

4-1. Synthesizing non-korean(english) audio

Results

Disclaimer

References

Author

About

Releases

Packages

Contributors 6

Languages

License

oheast/multi-speaker-tacotron-tensorflow

Folders and files

Latest commit

History

Repository files navigation

Multi-Speaker Tacotron in TensorFlow

Prerequisites

Usage

1. Install prerequisites

2-1. Generate custom datasets

2-2. Generate Korean datasets

3. Train a model

4. Synthesize audio

4-1. Synthesizing non-korean(english) audio

Results

Disclaimer

References

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages