Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is to solve an issue with GigaSpeech-like data prep; basically, computing features for a lot of short cuts of long recordings may be very wasteful, if:
a) the whole recording needs to be read each time; or
b) we are running some data augmentation with sox etc. on the recordings.
On the other hand, Lhotse storages we had so far didn't allow for both high compression with lilcom, and reading select chunks of data. This PR fixes it by adding a chunked lilcom storage (for now, just for HDF5).
It is somewhat slower to read (or decompress, I haven't profiled it in detail) whole matrices, but it's faster when we collect chunks. The benchmark below is done on mini-librispeech dev-clean-2, comparing old and new lilcom HDF5 storage readers.
The difference in storage size is minimal (36M old vs 37M chunked). Not sure whether to make it default or not -- not changing that for now.