Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334

pzelasko · 2021-07-07T21:25:35Z

This is to solve an issue with GigaSpeech-like data prep; basically, computing features for a lot of short cuts of long recordings may be very wasteful, if:
a) the whole recording needs to be read each time; or
b) we are running some data augmentation with sox etc. on the recordings.

On the other hand, Lhotse storages we had so far didn't allow for both high compression with lilcom, and reading select chunks of data. This PR fixes it by adding a chunked lilcom storage (for now, just for HDF5).

It is somewhat slower to read (or decompress, I haven't profiled it in detail) whole matrices, but it's faster when we collect chunks. The benchmark below is done on mini-librispeech dev-clean-2, comparing old and new lilcom HDF5 storage readers.

The difference in storage size is minimal (36M old vs 37M chunked). Not sure whether to make it default or not -- not changing that for now.

…ject

…ong recordings

…sues

…_num_features(end) - compute_num_features(start) != compute_num_features(end - start) ...

…ed fn

… some recipes

pzelasko · 2021-07-13T16:12:25Z

This seems to be working well -- merging, might add some other things in follow-up PRs

pzelasko added 3 commits July 7, 2021 14:36

Fix duplicated augmentations on cutset when cuts share a recording ob…

a04fc08

…ject

ChunkedLilcomHdf5 reader and writer for efficient feature lookup of l…

9f59ecb

…ong recordings

Register new writer in unit tests

e35d950

pzelasko added this to the v0.8 milestone Jul 7, 2021

pzelasko added 10 commits July 7, 2021 17:28

Expose chunked storage in the main namespace

acdba74

Fix an error where supervisions are dropped due to float precision is…

9271dbd

…sues

Change GigaSpeech subsets names to be more script friendly

4c3c5a8

Fix an issue with extra feature frame vs manifest: apparently compute…

f48d80b

…_num_features(end) - compute_num_features(start) != compute_num_features(end - start) ...

Adjust name pattern for manifests to work with read_manifests_if_cach…

6f4a96c

…ed fn

Enable re-creating only the partitions that were not processed yet in…

2d1a414

… some recipes

Fix typo

2c82e79

Refactor

48b1226

Surrender the battle with floating point arithmetic

44bd50c

Merge branch 'master' into feature/fix-recording-transforms

ca4656c

pzelasko changed the title ~~Chunked HDF5 feature storage + minor recording fixes~~ Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe Jul 13, 2021

pzelasko merged commit 4acf796 into master Jul 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334

Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334

pzelasko commented Jul 7, 2021 •

edited

Loading

pzelasko commented Jul 13, 2021

Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334

Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334

Conversation

pzelasko commented Jul 7, 2021 • edited Loading

pzelasko commented Jul 13, 2021

pzelasko commented Jul 7, 2021 •

edited

Loading