Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the hashtable dataset exactly as big as it needs to be #206

Merged
merged 2 commits into from
Sep 27, 2021

Conversation

asmeurer
Copy link
Collaborator

Previously it was always filled with 0s to be default_chunk_size aligned, but
this is no longer necessary now that we write to it once at the end. The
chunk_size resizing is still used for the in-memory NumPy array.

This makes the hashtables much smaller on disk for datasets without many
chunks.

This keeps the largest_index attribute for backwards compatibility with hash
tables that were created before this change (which will automatically be
resized down the next time they are written to). It also fixes the
documentation of largest_index.

Fixes #205.

Previously it was always filled with 0s to be default_chunk_size aligned, but
this is no longer necessary now that we write to it once at the end. The
chunk_size resizing is still used for the in-memory NumPy array.

This makes the hashtables much smaller on disk for datasets without many
chunks.

This keeps the largest_index attribute for backwards compatibility with hash
tables that were created before this change (which will automatically be
resized down the next time they are written to). It also fixes the
documentation of largest_index.

Fixes deshaw#205.
@ArvidJB
Copy link
Collaborator

ArvidJB commented Sep 24, 2021

Looks good to me. What are the file sizes now for the example code in #205 ?

@asmeurer
Copy link
Collaborator Author

Before data.h5 was 205K, with this PR it is 11K.

@asmeurer asmeurer merged commit 0bc978a into deshaw:master Sep 27, 2021
@ericdatakelly ericdatakelly added this to the September 2021 milestone Sep 29, 2021
@ArvidJB ArvidJB mentioned this pull request Sep 30, 2021
asmeurer added a commit that referenced this pull request Sep 30, 2021
Fix a bug with the hashtable introduced by #206
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty file taking up lots of disk space
3 participants