Random access from large CutSet in data loader #1008

desh2608 · 2023-03-22T20:14:45Z

desh2608
Mar 22, 2023
Collaborator

I am working on training speech recognition models for multi-talker recordings. For training, I created simulated mixtures using LibriSpeech utterances as sources (based on the Lhotse simulation workflow). I also have precomputed features for all the source utterances, as well as the mixtures.

I created a dataset (#951) which returns the mixed features and the supervisions for this mixture, and am able to train a model using the regular icefall recipes (with some changes of course).

Now, I want to add an auxiliary objective which would need the features of the source cuts. Since source utterances were sampled randomly for mixing, I cannot access them sequentially. This means that I cannot load the manifest in lazy mode --- and loading in eager mode seems to be exhausting the CPU memory for the dataloading worker. Note that there may be arbitrarily many source features.

Is there a more efficient way of returning the source features in the dataset?

Answered by pzelasko

Mar 22, 2023

We discussed offline but TL;DR the most promising solution is to attach a custom field array manifest that contains the concatenated source features together with their offsets in the time dimension so we wouldn't need to load source cuts at all.

View full answer

pzelasko · 2023-03-22T20:43:24Z

pzelasko
Mar 22, 2023
Maintainer

We discussed offline but TL;DR the most promising solution is to attach a custom field array manifest that contains the concatenated source features together with their offsets in the time dimension so we wouldn't need to load source cuts at all.

0 replies

desh2608 · 2023-04-03T15:56:35Z

desh2608
Apr 3, 2023
Collaborator Author

For completeness, here is what I ended up doing based on Piotr's suggestion:

with CutSet.open_writer(manifest_path) as cut_writer, LilcomChunkyWriter(
    storage_path
) as source_feat_writer:
    for cut in mixed_cuts:
        source_feats = []
        source_feat_offsets = []
        cur_offset = 0
        for sup in sorted(cut.supervisions, key=lambda s: (s.start, s.speaker)):
            source_cut = source_cuts[sup.id]
            source_feats.append(source_cut.load_features())
            source_feat_offsets.append(cur_offset)
            cur_offset += source_cut.num_frames
        cut.source_feats = source_feat_writer.store_array(
            cut.id, np.concatenate(source_feats, axis=0)
        )
        cut.source_feat_offsets = source_feat_offsets
        cut_writer.write(cut)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random access from large CutSet in data loader #1008

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Random access from large CutSet in data loader #1008

desh2608 Mar 22, 2023 Collaborator

Replies: 2 comments

pzelasko Mar 22, 2023 Maintainer

desh2608 Apr 3, 2023 Collaborator Author

desh2608
Mar 22, 2023
Collaborator

pzelasko
Mar 22, 2023
Maintainer

desh2608
Apr 3, 2023
Collaborator Author