Skip to content

Latest commit

 

History

History
101 lines (70 loc) · 4.08 KB

README.md

File metadata and controls

101 lines (70 loc) · 4.08 KB

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion

arXiv githubio GitHub Repo stars GitHub

In this paper, we adopt the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. We disentangle content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.

🤗 Play online at HuggingFace Spaces.

Visit our demo page for audio samples.

We also provide the pretrained models.

training inference
(a) Training (b) Inference

Updates

  • Code release. (Nov 27, 2022)
  • Online demo at HuggingFace Spaces. (Dec 14, 2022)
  • Supports 24kHz outputs. See here for details. (Dec 15, 2022)
  • Fix data loading bug. (Jan 10, 2023)

Pre-requisites

  1. Clone this repo: git clone https://github.com/OlaWod/FreeVC.git

  2. CD into this repo: cd FreeVC

  3. Install python requirements: pip install -r requirements.txt

  4. Download WavLM-Large and put it under directory 'wavlm/'

  5. Download the VCTK dataset (for training only)

  6. Download HiFi-GAN model and put it under directory 'hifigan/' (for training with SR only)

Inference Example

Download the pretrained checkpoints and run:

# inference with FreeVC
CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile logs/freevc.json --ptfile checkpoints/freevc.pth --txtpath convert.txt --outdir outputs/freevc

# inference with FreeVC-s
CUDA_VISIBLE_DEVICES=0 python convert.py --hpfile logs/freevc-s.json --ptfile checkpoints/freevc-s.pth --txtpath convert.txt --outdir outputs/freevc-s

Training Example

  1. Preprocess
python downsample.py --in_dir </path/to/VCTK/wavs>
ln -s dataset/vctk-16k DUMMY

# run this if you want a different train-val-test split
python preprocess_flist.py

# run this if you want to use pretrained speaker encoder
CUDA_VISIBLE_DEVICES=0 python preprocess_spk.py

# run this if you want to train without SR-based augmentation
CUDA_VISIBLE_DEVICES=0 python preprocess_ssl.py

# run these if you want to train with SR-based augmentation
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 68 --max 72
CUDA_VISIBLE_DEVICES=1 python preprocess_sr.py --min 73 --max 76
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 77 --max 80
CUDA_VISIBLE_DEVICES=2 python preprocess_sr.py --min 81 --max 84
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 85 --max 88
CUDA_VISIBLE_DEVICES=3 python preprocess_sr.py --min 89 --max 92
  1. Train
# train freevc
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/freevc.json -m freevc

# train freevc-s
CUDA_VISIBLE_DEVICES=2 python train.py -c configs/freevc-s.json -m freevc-s

References