Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe #334

Merged
merged 13 commits into from
Jul 13, 2021

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jul 7, 2021

This is to solve an issue with GigaSpeech-like data prep; basically, computing features for a lot of short cuts of long recordings may be very wasteful, if:
a) the whole recording needs to be read each time; or
b) we are running some data augmentation with sox etc. on the recordings.

On the other hand, Lhotse storages we had so far didn't allow for both high compression with lilcom, and reading select chunks of data. This PR fixes it by adding a chunked lilcom storage (for now, just for HDF5).

It is somewhat slower to read (or decompress, I haven't profiled it in detail) whole matrices, but it's faster when we collect chunks. The benchmark below is done on mini-librispeech dev-clean-2, comparing old and new lilcom HDF5 storage readers.

image

The difference in storage size is minimal (36M old vs 37M chunked). Not sure whether to make it default or not -- not changing that for now.

@pzelasko pzelasko added this to the v0.8 milestone Jul 7, 2021
@pzelasko
Copy link
Collaborator Author

This seems to be working well -- merging, might add some other things in follow-up PRs

@pzelasko pzelasko changed the title Chunked HDF5 feature storage + minor recording fixes Chunked HDF5 feature storage + minor recording fixes + adjust GigaSpeech recipe Jul 13, 2021
@pzelasko pzelasko merged commit 4acf796 into master Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant