Skip to content

Why do we add a blank token between phones?

Shivam Mehta edited this page Sep 3, 2024 · 1 revision

This was asked in an issue. I feel this is a useful thing for everyone to know thus, moving it to wiki.

TLDR: The idea comes from multiple states per phone in a Hidden Markov Model (HMM) based speech synthesisers for better modelling. [Our previous work Neural-HMM and OverFlow have also used that.] Since Monotonic Alignment Search (MAS) (introduced in Glow-TTS) is a Viterbi approximation to the forward algorithm the idea has its root from the same literature. You can use multiple states to model the transition between different sounds.

More details: In Statistical Parametric Speech Synthesis (SPSS) times (You can read more about it here in section 2.2 right below equation 2.28), people used multiple states to model each phoneme. They found it beneficial to model certain dynamic features with more states which were especially useful in modelling certain sounds for example plosives (In English: p, t, k, b, d, g), where you have silence, sudden burst in energy and then silence again. These were hard to model for a left-to-right algorithm with no skip (like the MAS) without multiple states representing them as each state had emission parameters.

Modern neural network-based speech synthesisers are much more powerful approximators. So, the idea behind adding an extra state is to provide a placeholder for the MAS to learn such dynamic variation and transition between sounds, where two states seem to be a nice compromise between having the model learn these dynamic variations when needed and jumping directly to next sound in case, it doesn't need to learn that variation (some transitions don't need a gap between them) and also fewer tensors on the GPU than having 3 states like in HMM-based synthesisers.