-
I am working on training speech recognition models for multi-talker recordings. For training, I created simulated mixtures using LibriSpeech utterances as sources (based on the Lhotse simulation workflow). I also have precomputed features for all the source utterances, as well as the mixtures. I created a dataset (#951) which returns the mixed features and the supervisions for this mixture, and am able to train a model using the regular icefall recipes (with some changes of course). Now, I want to add an auxiliary objective which would need the features of the source cuts. Since source utterances were sampled randomly for mixing, I cannot access them sequentially. This means that I cannot load the manifest in lazy mode --- and loading in eager mode seems to be exhausting the CPU memory for the dataloading worker. Note that there may be arbitrarily many source features. Is there a more efficient way of returning the source features in the dataset? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
We discussed offline but TL;DR the most promising solution is to attach a custom field array manifest that contains the concatenated source features together with their offsets in the time dimension so we wouldn't need to load source cuts at all. |
Beta Was this translation helpful? Give feedback.
-
For completeness, here is what I ended up doing based on Piotr's suggestion: with CutSet.open_writer(manifest_path) as cut_writer, LilcomChunkyWriter(
storage_path
) as source_feat_writer:
for cut in mixed_cuts:
source_feats = []
source_feat_offsets = []
cur_offset = 0
for sup in sorted(cut.supervisions, key=lambda s: (s.start, s.speaker)):
source_cut = source_cuts[sup.id]
source_feats.append(source_cut.load_features())
source_feat_offsets.append(cur_offset)
cur_offset += source_cut.num_frames
cut.source_feats = source_feat_writer.store_array(
cut.id, np.concatenate(source_feats, axis=0)
)
cut.source_feat_offsets = source_feat_offsets
cut_writer.write(cut) |
Beta Was this translation helpful? Give feedback.
We discussed offline but TL;DR the most promising solution is to attach a custom field array manifest that contains the concatenated source features together with their offsets in the time dimension so we wouldn't need to load source cuts at all.