Skip to content

sosuperic/speakerboxxx

Repository files navigation

speakerboxxx

Clean data, extract features, and train backend of deep LSTM-based speech synthesizer

Overview

  • Pre-process data by extracting linguistic input features, duration target features (phone durations), and acoustic target features
  • Duration model
  • Acoustic model
  • Test
    • Loss by calculating loss for each model on test set
    • Full pipeline by passing linguistic features of test set into duration model, then using that output plus the linguistic features as input into acoustic model. Audio, spectral features generated by acoustic model are then fed into WORLD vocoder

Notable dependencies

  • Python: data munging
  • Lua and Torch: data munging, training neural nets
  • Julia: interface to C++ WORLD vocoder, extracting acoustic features
  • HTK, Prosodylab-Aligner, textgrid: forced alignment on prompt and wav to get phone and word timings

Datasets

  • CMUArctic
    • 4 sets of 1 hour single-speaker, 2 male 2 female
    • phonetic timing labels provided, word timing tables not provided
  • Blizzard2013
    • 19 hours single speaker, female
    • phonetic and word timing labels extracted using Prosodylab-Aligner

Using each dataset

  • To generate and save input features / targets:
    • Blizzard2013:
      python BlizzardFeatureExtractor.py 
      
    • CMUArctic:
      python ArcticFeatureExtractor.py
      julia ArcticAcousticFeatureExtractor.jl
      
  • To create train, valid, test splits:
    • Blizzard2013:
      th DataSplitter.lua -dataset blizzard2013
      
    • CMUArctic:
      th DataSplitter.lua -dataset cmuarctic
      
  • When training / testing with main.lua, pass "cmuarctic" or "blizzard2013" to 'dataset' flag

Force-aligning to get phone and word times

This is used for Blizzard2013 and future datasets where only wav files and prompts are provided.

Installing HTK, Prosodylab-Aligner, textgrid

NOTE on lexicon when force-aligning

  • ProsodyLab-Aligner uses CMUDict to do the grapheme to phoneme (g2p) conversion. As such, when there are OOV words in the corpus being aligned, it produces a OOV.txt file and fails.
  • To train on Blizzard2013, which does contain OOV words, there are two options:
      1. Avoid all prompts that have OOV words
      1. ProsodyLab-Aligner allows one to provide a dictionary of g2p mappings for the OOV words (check its README for instructions)
    • I have chosen to go with option 1, as a) there still 7438 prompts, down from the original 9734 prompts, and b) the LSM speech corpus is being created using FestVox, which also uses CMUDict

NOTE on silences

  • Force-alignments include silent phones and silent words
  • These occur both in the beginning/end and in the middle
  • Currently, these are handled by
    • ignoring the beginning/end silences (e.g. when extracting target durations, these are skipped)
    • splitting the middle silences (e.g. the time for that silence is split between the previous and the next phone/word)

Features

  • Linguistic Input
    • Dimension:
    • 39 (phone), etc.
  • Duration Target
    • Dimension: (# of phonemes in seq) x 1
  • Acoustic Target
    • Dimension: (# of frames) x (1 + 1 + sp + ap)
      • 1 for voiced/unvoiced, 1 for f0, sp = TODO:, ap = TODO
    • of frames = (length of spoken phonemes in ms) / 5ms

      • Approximately. In reality, each phoneme length is divided by 5 and rounded
    • where (length of spoken phonemes) excludes silent phonemes

Running

These are also saved in the command line examples in main.lua

Training (best parameters)

Duration:

th main.lua  -gpuid 0 -model duration -notes linear254to256_linear256to128_lstm128to128_linear128to1 -save_model_every_epoch 10 -maxepochs 100 -lr 0.001 -method adam

Acoustic:

th main.lua  -gpuid 1 -maxepochs 300 -save_model_every_epoch 10 -lr 0.0005 -method adam -model acoustic -notes linear254to512_linear512to512_lstm512to256_lstm256to256_linear256to84__QUINPHONE_f0INTERPOLATE

Testing

Shannon, w GPU:

th main.lua -gpuid 0 -mode test -load_duration_model_path models/duration/2016_8_3___15_5_38/net_e100.t7 -load_acoustic_model_path models/acoustic/2016_8_3___17_24_13/net_e270.t7

Local, no GPU:

th main.lua -mode test -load_duration_model_path models/duration/2016_7_20___5_16_19/net_e9.t7 -load_acoustic_model_path models/acoustic/2016_7_20___14_30_32/net_e1.t7

Current best model, features, and run parameters, etc.

Features:

  • Quinphone identities
  • Interpolating F0
  • Silencing / not silencing

Run parameters:

  • Adam

TODO

To try improving performance

  • Feature normalization
    • Acoustic
    • Lniguistic in [0.01, 0.99]
  • More linguistic features
    • morpheme-level
    • lexical stress
    • distance from stressed/accented syllable
    • position of syllable in utterance (as opposed to just position of syllable in word)
    • POS of current/preceding/following word
  • Mel-cepstral distortion (MCD) loss instead of MSE
  • Parameter generation
  • Implement Adadec (mentioned in a few papers)
  • Remove silence frames (mentioned in a paper or two)

Experiments

  • Multiple speakers
    • Add binary feature to loss that predicts which speaker
    • Can also be used to test deep density mixture
    • Pass speaker id as one hot encoded vector
  • phone2vec
  • How does 1 hour vs 5 hour vs 10 hour affect performance?
  • Softmax classification for Log F0 values instead of regression (inspired by PixelRNN)

For production on Android

  • Port models for Tensorflow for easy Android integration
  • Write sp2mc for WORLD in C++
  • Build phone-syllable-word contexts from raw text, not just on datasets with forced alignment
    • Requires use of g2p model
  • Shrink, quantize models

SE & other

  • config for dataset and which paths to use
  • better way to keep track of how different features affects performance, storing features/outputs when trained with different features in folder whose name includes those new features (e.g. now that quinphones in linguistic inputs is definitively better, it shouldn't really be caled linguistic_inputs_plus anymore)
  • some scripts for copying files to and from local and Shannon
    • For example, testing full pipeline is done on Shannon, but generation of wav files is done locally because installing Julia requires some upgrades I don't want to make (plus we don't need to take up space on Shannon)

Other notes

Phoneset: ['AA', 'AE', 'AH', 'AO', 'AW', 'AY', 'B', 'CH', 'D', 'DH', 'EH', 'ER', 'EY', 'F', 'G', 'HH', 'IH', 'IY', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OY', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UW', 'V', 'W', 'Y', 'Z', 'ZH']

POS set: ['VERB', 'NOUN', 'PRON', 'ADJ', 'ADV', 'ADP', 'CONJ', # 'DET', 'NUM', 'PRT', 'X', '.']

About

Backend of deep LSTM-based speech synthesizer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published