-
Notifications
You must be signed in to change notification settings - Fork 46
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 270bf4e
Showing
55 changed files
with
118,459 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# IDE configuration | ||
.idea/ | ||
|
||
# configuration | ||
config/* | ||
!config/default.yaml | ||
temp-restore.yaml | ||
|
||
# logs, checkpoints | ||
chkpt/ | ||
logs/ | ||
|
||
# just a temporary folder | ||
temp/ | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
.pytest_cache/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
db.sqlite3 | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright (c) 2019, Seungwon Park 박승원 | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without | ||
modification, are permitted provided that the following conditions are met: | ||
|
||
1. Redistributions of source code must retain the above copyright notice, this | ||
list of conditions and the following disclaimer. | ||
|
||
2. Redistributions in binary form must reproduce the above copyright notice, | ||
this list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
3. Neither the name of the copyright holder nor the names of its | ||
contributors may be used to endorse or promote products derived from | ||
this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,190 @@ | ||
# UnivNet | ||
**UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation** | ||
|
||
This is an unofficial PyTorch implementation of ***Jang et al.* (Kakao), [UnivNet](https://arxiv.org/abs/2106.07889)**. | ||
|
||
[![arXiv](https://img.shields.io/badge/arXiv-2106.07889-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2106.07889)[![githubio](https://img.shields.io/static/v1?message=Audio%20Samples&logo=Github&labelColor=grey&color=blue&logoColor=white&label=%20&style=flat-square)](https://mindslab-ai.github.io/univnet/)![GitHub](https://img.shields.io/github/license/mindslab-ai/univnet) | ||
|
||
|
||
|
||
## To-Do List | ||
|
||
- [ ] Release checkpoint of pre-trained model | ||
- [ ] Extract wav samples for audio sample page | ||
- [ ] Add results including validation loss graph | ||
|
||
|
||
## Key Features | ||
|
||
<img src="docs/model_architecture.png" width="100%"> | ||
|
||
- According to the authors of the paper, UnivNet obtained the best objective results among the recent GAN-based neural vocoders (including HiFi-GAN) as well as outperforming HiFi-GAN in a subjective evaluation. Also its inference speed is 1.5 times faster than HiFi-GAN. | ||
|
||
- This repository uses the same mel-spectrogram function as the [Official HiFi-GAN](https://github.com/jik876/hifi-gan), which is compatible with [NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2). | ||
|
||
- Our default mel calculation hyperparameters are as below, following the original paper. | ||
|
||
```yaml | ||
audio: | ||
n_mel_channels: 100 | ||
filter_length: 1024 | ||
hop_length: 256 # WARNING: this can't be changed. | ||
win_length: 1024 | ||
sampling_rate: 24000 | ||
mel_fmin: 0.0 | ||
mel_fmax: 12000.0 | ||
``` | ||
You can modify the hyperparameters to be compatible with your acoustic model. | ||
## Prerequisites | ||
The implementation needs following dependencies. | ||
0. Python 3.6 | ||
1. [PyTorch](https://pytorch.org/) 1.6.0 | ||
2. [NumPy](https://numpy.org/) 1.17.4 and [SciPy](https://www.scipy.org/) 1.5.4 | ||
3. Install other dependencies in [requirements.txt](.requirements.txt). | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
## Datasets | ||
|
||
**Preparing Data** | ||
|
||
- Download the training dataset. This can be any wav file with sampling rate 24,000Hz. The original paper used LibriTTS. | ||
- LibriTTS train-clean-360 split [tar.gz link](https://www.openslr.org/resources/60/train-clean-360.tar.gz) | ||
- Unzip and place its contents under `datasets/LibriTTS/train-clean-360`. | ||
- If you want to use wav files with a different sampling rate, please edit the configuration file (see below). | ||
|
||
Note: The mel-spectrograms calculated from audio file will be saved as `**.mel` at first, and then loaded from disk afterwards. | ||
|
||
|
||
|
||
**Preparing Metadata** | ||
|
||
Following the format from [NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2), the metadata should be formatted as: | ||
``` | ||
path_to_wav|transcript|speaker_id | ||
path_to_wav|transcript|speaker_id | ||
... | ||
``` | ||
|
||
Train/validation metadata for LibriTTS train-clean-360 split and are already prepared in `datasets/metadata`. | ||
5% of the train-clean-360 utterances were randomly sampled for validation. | ||
|
||
Since this model is a vocoder, the transcripts are **NOT** used during training. | ||
|
||
## Train | ||
|
||
**Preparing Configuration Files** | ||
|
||
- Run `cp config/default.yaml config/config.yaml` and then edit `config.yaml` | ||
|
||
- Write down the root path of train/validation in the `data` section. The data loader parses list of files within the path recursively. | ||
|
||
```yaml | ||
data: | ||
train_dir: 'datasets/' # root path of train data (either relative/absoulte path is ok) | ||
train_meta: 'metadata/libritts_train_clean_360_train.txt' # relative path of metadata file from train_dir | ||
val_dir: 'datasets/' # root path of validation data | ||
val_meta: 'metadata/libritts_train_clean_360_val.txt' # relative path of metadata file from val_dir | ||
``` | ||
|
||
We provide the default metadata for LibriTTS train-clean-360 split. | ||
|
||
- Modify `channel_size` in `gen` to switch between UnivNet-c16 and c32. | ||
|
||
```yaml | ||
gen: | ||
noise_dim: 64 | ||
channel_size: 32 # 32 or 16 | ||
dilations: [1, 3, 9, 27] | ||
strides: [8, 8, 4] | ||
lReLU_slope: 0.2 | ||
``` | ||
|
||
**Training** | ||
|
||
```bash | ||
python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN | ||
``` | ||
|
||
**Tensorboard** | ||
|
||
```bash | ||
tensorboard --logdir logs/ | ||
``` | ||
|
||
If you are running tensorboard on a remote machine, you can open the tensorboard page by adding `--bind_all` option. | ||
|
||
## Inference | ||
|
||
```bash | ||
python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH | ||
``` | ||
|
||
## Pre-trained Model | ||
|
||
A pre-trained model will be released soon. | ||
The model was trained on LibriTTS train-clean-360 split. | ||
|
||
## Results | ||
|
||
See audio samples at https://mindslab-ai.github.io/univnet/ | ||
|
||
|
||
|
||
**Comparison with the results on paper** | ||
|
||
| Model | MOS | PESQ(↑) | RMSE(↓) | | ||
| :----------------------------: | :-------: | :-----: | :-----: | | ||
| Recordings | 4.16±0.09 | 4.50 | 0.000 | | ||
| Results in Paper (UnivNet-c32) | 3.93±0.09 | 3.70 | 0.316 | | ||
| **Ours (UnivNet-c32)** | - | **TBD** | **TBD** | | ||
|
||
## Note | ||
|
||
This code is an unofficial implementation, there may be some differences from the original paper. | ||
|
||
- Our UnivNet generator has smaller number of parameters (c32: 5.11M, c16: 1.42M) than the paper (c32: 14.89M, c16: 4.00M). So far, we have not encountered any issues from using a smaller model size. If run into any problem, please report it as an issue. | ||
|
||
## Implementation Authors | ||
|
||
Implementation authors are: | ||
|
||
- [Kang-wook Kim](http://github.com/wookladin) @ [MINDsLab Inc.](https://maum.ai/) (<a href="mailto:full324@snu.ac.kr">full324@snu.ac.kr</a>, <a href="mailto:kwkim@mindslab.ai">kwkim@mindslab.ai</a>) | ||
- [Wonbin Jung](https://github.com/Wonbin-Jung) @ [MINDsLab Inc.](https://maum.ai/) (<a href="mailto:santabin@kaist.ac.kr">santabin@kaist.ac.kr</a>, <a href="mailto:wbjung@mindslab.ai">wbjung@mindslab.ai</a>) | ||
|
||
|
||
|
||
Special thanks to | ||
|
||
- [Seungu Han](https://github.com/Seungwoo0326) @ [MINDsLab Inc.](https://maum.ai/) | ||
- [Junhyeok Lee](https://github.com/junjun3518) @ [MINDsLab Inc.](https://maum.ai/) | ||
- [Sang Hoon Woo](https://github.com/tonyswoo) @ [MINDsLab Inc.](https://maum.ai/) | ||
|
||
## License | ||
|
||
This code is licensed under BSD 3-Clause License. | ||
|
||
We referred following codes and repositories. | ||
|
||
- The overall structure of the repository is based on [https://github.com/seungwonpark/melgan](https://github.com/seungwonpark/melgan). | ||
- [datasets/dataloader.py](./datasets/dataloader.py) from https://github.com/NVIDIA/waveglow (BSD 3-Clause License) | ||
- [model/mpd.py](./model/mpd.py) from https://github.com/jik876/hifi-gan (MIT License) | ||
- [model/lvcnet.py](./model/lvcnet.py) from https://github.com/zceng/LVCNet (Apache License 2.0) | ||
- [utils/stft_loss.py](./utils/stft_loss.py) # Copyright 2019 Tomoki Hayashi # MIT License (https://opensource.org/licenses/MIT) | ||
|
||
## References | ||
|
||
Papers | ||
|
||
- *Jang et al.*, [UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation](https://arxiv.org/abs/2106.07889) | ||
- *Zeng et al.*, [LVCNet: Efficient Condition-Dependent Modeling Network for Waveform Generation](https://arxiv.org/abs/2102.10815) | ||
- *Kong et al.*, [HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis](https://arxiv.org/abs/2010.05646) | ||
|
||
Datasets | ||
|
||
- [LibriTTS](https://openslr.org/60/) |
Oops, something went wrong.