Return lengths from SpeechSynthesis dataset #209

janvainer · 2021-02-26T17:45:22Z

janvainer
Feb 26, 2021

Would it be of interest to add feature_lens and tokens_lens fields to the SpeechSynthesis dataset? Otherwise the information is lost after padding (At least in case of features. The tokens are passed as list of chars w/o padding right now).
It would be useful for the following things:

Masking for alignment, duration and spectrogram losses
Masked batch normalization - not masking batch norm causes variance estimation biasing (there are other ways around it though such as using random noise or reflection padding or mitigate the effect with bucket sampling)
Tricks like attention or alignment pre-conditioning (used eg in Deep voice 3 or SpeedySpeech)
Attention masking in models like FastSpeech, Deep voice 3, SpeedySpeech etc.

Would love to hear your ideas @pzelasko :).

pzelasko · 2021-02-26T17:54:55Z

pzelasko
Feb 26, 2021
Maintainer

Yeah, I believe that is the way to go.

Probably also makes sense to have some sort of symbol to ID mapping and return int IDs for tokens instead. It should be created via a separate function/class since it could be useful for other tasks. We'll also need a function that collates the token IDs - I have a prototype in another project, but hadn't had time to polish this for Lhotse:

def collate_text(
        token_sequences: List[List[str]],
        sym2int: Optional[Dict[str, int]] = None,
        add_eos: bool = False,
        pad_symbol: str = '<pad>',
        eos_symbol: str = '<eos>',
        unk_symbol: str = '<unk>'
) -> List[List[Union[str, int]]]:
    max_len = len(max(token_sequences, key=len))
    seqs = [
        seq + ([eos_symbol] if add_eos else []) + [pad_symbol] * (max_len - len(seq))
        for seq in token_sequences
    ]
    if sym2int is not None:
        seqs = [
            [sym2int[sym] if sym in sym2int else sym2int[unk_symbol] for sym in seq]
            for seq in seqs
        ]
    return torch.tensor(seqs, dtype=torch.int64)

In general, I would like this to eventually be general enough to handle transforms like G2P or tokenization so that we can use any of: graphemes/phonemes/subwords/words, but it is fine to start with supporting just the grapheme case.

5 replies

janvainer Feb 26, 2021
Author

Nice! I already have some prototypes too. They just need a bit of love and will be releasable.

Some design questions that have to be dealt with:

Should text normalization happen before the dataset creation or earlier via cuts.map_supervisions?
Should phoneme/subword mapping happen before the dataset creation?

My ideas are, let's do both points before the dataset creation:

Normalizing before training saves a bit of execution time during training
Phoneme generation can be pretty slow and doing it during training could slow things down. It also follows the principle of pre-computing input features, which is nice. This approach also really simplifies the collation, because we will be able to tread any characters passed to the dataset uniformly.

WYT?

pzelasko Feb 26, 2021
Maintainer

That sounds good to me!

janvainer Feb 27, 2021
Author

Alright, there is a draft PR #211 that includes the tokens and features lengths

danpovey Feb 27, 2021
Collaborator

Great to have a new contributor!

janvainer Feb 27, 2021
Author

Happy to help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return lengths from SpeechSynthesis dataset #209

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Return lengths from SpeechSynthesis dataset #209

janvainer Feb 26, 2021

Replies: 1 comment · 5 replies

pzelasko Feb 26, 2021 Maintainer

janvainer Feb 26, 2021 Author

pzelasko Feb 26, 2021 Maintainer

janvainer Feb 27, 2021 Author

danpovey Feb 27, 2021 Collaborator

janvainer Feb 27, 2021 Author

janvainer
Feb 26, 2021

Replies: 1 comment 5 replies

pzelasko
Feb 26, 2021
Maintainer

janvainer Feb 26, 2021
Author

pzelasko Feb 26, 2021
Maintainer

janvainer Feb 27, 2021
Author

danpovey Feb 27, 2021
Collaborator

janvainer Feb 27, 2021
Author