Replies: 1 comment 5 replies
-
Yeah, I believe that is the way to go. Probably also makes sense to have some sort of symbol to ID mapping and return int IDs for tokens instead. It should be created via a separate function/class since it could be useful for other tasks. We'll also need a function that collates the token IDs - I have a prototype in another project, but hadn't had time to polish this for Lhotse: def collate_text(
token_sequences: List[List[str]],
sym2int: Optional[Dict[str, int]] = None,
add_eos: bool = False,
pad_symbol: str = '<pad>',
eos_symbol: str = '<eos>',
unk_symbol: str = '<unk>'
) -> List[List[Union[str, int]]]:
max_len = len(max(token_sequences, key=len))
seqs = [
seq + ([eos_symbol] if add_eos else []) + [pad_symbol] * (max_len - len(seq))
for seq in token_sequences
]
if sym2int is not None:
seqs = [
[sym2int[sym] if sym in sym2int else sym2int[unk_symbol] for sym in seq]
for seq in seqs
]
return torch.tensor(seqs, dtype=torch.int64) In general, I would like this to eventually be general enough to handle transforms like G2P or tokenization so that we can use any of: graphemes/phonemes/subwords/words, but it is fine to start with supporting just the grapheme case. |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Would it be of interest to add
feature_lens
andtokens_lens
fields to the SpeechSynthesis dataset? Otherwise the information is lost after padding (At least in case of features. The tokens are passed as list of chars w/o padding right now).It would be useful for the following things:
Would love to hear your ideas @pzelasko :).
Beta Was this translation helpful? Give feedback.
All reactions