GitHub - alexbrx/emo-clone: Deep generative model for emotional speech

Disentangled Representation Learning and Generative Adversarial Networks for Emotional Voice Cloning

Motivation

Given a recorded speech sample we would like to generate new samples having some qualitative aspects like speaker's voice timbre, prosody, emotion etc. altered.
Naive application of state-of-the-art GANs for image style transfer doesn't deliver good results because these in general are not well suited to handle sequential data like speech.
The goal is to design an adversarially trained network capable of generating high quality speech samples in our setting.

Intuition

SpeechSplit is an autoencoder neural network whcih decomposes speech into disentangled latent representations corresponding to four main perceptual aspects of speech i.e. pitch, rhythm, lexical content and speaker's voice timbre.
The latents can be synthesized back into speech, hence it may be possible to perform style transfer by simply generating and substituting some of the latents to synthesize altered samples.
Authors of SpeechSplit confirm that this method works if the latents are swapped between parallel utterances (i.e. actual recordings of people uttering the same sentence).
We show that some latents can also be successfully generated by a relatively simple GAN, which introduces significant sample quality improvement over baseline end-to-end GANs in our voice style transfer task.
In other words, SpeechSplit autoencoder is used in our proposed model to simplify the structure of the data so that it can be more easily captured by a GAN.

Model Overview

Mel-spectrogram and pitch contour are extracted from raw waveform.
Resemblyzer (an independent neural network trained on a speaker verification task) computes speaker embedding (a vector being a high-level representation of speaker's voice) from a mel-spectrogram.
Style codes for pitch, rhythm and lexical content are provided by SpeechSplit encoder.
New latent representations for pitch and rhythm are generated by CodeGAN, whereas new speaker embeddings are sampled from VoiceGAN.
SpeechSplit decoder synthesizes output mel-spectrogram from speaker embedding and pitch, rhythm and content codes.
Mel-spectrograms are converted to output waveform by WaveGlow.

Dependencies

See requirements.txt

Run Pretrained Model

git clone https://github.com/alexbrx/emo-clone.git
cd emo-clone
bash setup.sh
pip install -r requirements.txt
python utils/fake_cvoice_samples.py

Datasets

IEMOCAP
VCTK
Common Voice
LJSpeech

Training

SpeechSplit is trained on VCTK and fine-tuned on IEMOCAP after which the weights become frozen.
Latent codes are computed by SpeechSplit for IEMOCAP samples on which CodeGAN is subsequently trained.
A dataset of speaker embeddings is created from Common Voice dataset on which VoiceGAN is subsequently trained.
WaveGlow vocoder is independently trained on LJSpech to convert mel-spectrograms into waveforms.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
SpeechSplit		SpeechSplit
codegan		codegan
cvoicegan		cvoicegan
jpg		jpg
utils		utils
waveglow		waveglow
readme.md		readme.md
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disentangled Representation Learning and Generative Adversarial Networks for Emotional Voice Cloning

Motivation

Intuition

Model Overview

Dependencies

Run Pretrained Model

Datasets

Training

Related Repos

About

Releases

Packages

Languages

alexbrx/emo-clone

Folders and files

Latest commit

History

Repository files navigation

Disentangled Representation Learning and Generative Adversarial Networks for Emotional Voice Cloning

Motivation

Intuition

Model Overview

Dependencies

Run Pretrained Model

Datasets

Training

Related Repos

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages