This is a speech synthesis composite model that simultaneously reconstructs mel-spectrogram and wave form from text. The model generates wave form from symbol sequences separated by space. The model is built on top of the modified ForwardTacotron and modified MelGAN frameworks.
Metric | Value |
---|---|
Source framework | PyTorch* |
The text-to-speech-en-0001-duration-prediction model is a ForwardTacotron-based duration predictor for symbols.
Metric | Value |
---|---|
GFlops | 15.84 |
MParams | 13.569 |
-
Sequence, name:
input_seq
, shape:1, 512
, format:B,C
, where:B
- batch sizeC
- number of symbols in sequence
-
Mask for input sequence, name:
input_mask
, shape:1, 1, 512
, format:B, D, C
, where:B
- batch sizeD
- extra dimension for multiplicationC
- number of symbols in sequence
-
Mask for relative position representation in attention, name:
pos_mask
, shape:1, 1, 512, 512
, format:B, D, C, C
, where:B
- batch sizeD
- extra dimension for multiplicationC
- number of symbols in sequence
-
Duration for input symbols, name:
duration
, shape:1, 512, 1
, formatB, C, H
. Contains predicted duration for each of the symbol in sequence.B
- batch sizeC
- number of symbols in sequenceH
- empty dimension
-
Processed embeddings, name:
embeddings
, shape:1, 512, 256
, formatB, C, H
. Contains processed embeddings for each symbol in sequence.B
- batch sizeC
- number of symbols in sequenceH
- height of the intermediate feature map
The text-to-speech-en-0001-regression model accepts aligned by duration processed embeddings (for example: if duration is [2, 3] and processed embeddings is [[1, 2], [3, 4]], aligned embeddings is [[1, 2], [1, 2], [1,2], [3, 4], [3, 4]]) and produces mel-spectrogram.
Metric | Value |
---|---|
GFlops | 7.65 |
MParams | 4.96 |
-
Processed embeddigs aligned by durations, name:
data
, shape:1, 512, 256
, format:B, T, C
, where:B
- batch sizeT
- time in mel-spectrogramC
- processed embedding dimension
-
Mask for
data
by time dimension, name:data_mask
, shape:1, 1, 512
, format:B, D, T
, where:B
- batch sizeD
- extra dimension for multiplicationT
- time in mel-spectrogram
-
Mask for relative position representation in attention, name:
pos_mask
, shape:1, 1, 512, 512
, format:B, D, C, C
, where:B
- batch sizeD
- extra dimension for multiplicationC
- number of symbols in sequence
Mel-spectrogram, name: mel
, shape: 80, 512
, format: C, T
, where:
T
- time in mel-spectrogramC
- number of rows in mel-spectrogram
The text-to-speech-en-0001-generation model is a MelGAN based audio generator.
Metric | Value |
---|---|
GFlops | 48.38 |
MParams | 12.77 |
Mel-spectrogram, name: mel
, shape: 1, 80, 128
, format: B, C, T
, where:
B
- batch sizeC
- number of rows in mel-spectrogramT
- time in mel-spectrogram
Audio, name: audio
, shape: 32768
, format: T
, where:
T
- time in audio with sampling rate 22050 (~1.5 sec).
The model can be used in the following demos provided by the Open Model Zoo to show its capabilities:
[*] Other names and brands may be claimed as the property of others.