Make the hashtable dataset exactly as big as it needs to be #206

asmeurer · 2021-09-23T22:52:16Z

Previously it was always filled with 0s to be default_chunk_size aligned, but
this is no longer necessary now that we write to it once at the end. The
chunk_size resizing is still used for the in-memory NumPy array.

This makes the hashtables much smaller on disk for datasets without many
chunks.

This keeps the largest_index attribute for backwards compatibility with hash
tables that were created before this change (which will automatically be
resized down the next time they are written to). It also fixes the
documentation of largest_index.

Fixes #205.

Previously it was always filled with 0s to be default_chunk_size aligned, but this is no longer necessary now that we write to it once at the end. The chunk_size resizing is still used for the in-memory NumPy array. This makes the hashtables much smaller on disk for datasets without many chunks. This keeps the largest_index attribute for backwards compatibility with hash tables that were created before this change (which will automatically be resized down the next time they are written to). It also fixes the documentation of largest_index. Fixes deshaw#205.

ArvidJB · 2021-09-24T13:42:01Z

Looks good to me. What are the file sizes now for the example code in #205 ?

asmeurer · 2021-09-24T20:03:07Z

Before data.h5 was 205K, with this PR it is 11K.

Fix a bug with the hashtable introduced by #206

asmeurer mentioned this pull request Sep 23, 2021

Empty file taking up lots of disk space #205

Closed

Fix slicetools test for the latest version of h5py

a92c8f8

asmeurer merged commit 0bc978a into deshaw:master Sep 27, 2021

ericdatakelly added this to the September 2021 milestone Sep 29, 2021

ArvidJB mentioned this pull request Sep 30, 2021

Bug with #206 #208

Closed

asmeurer added a commit that referenced this pull request Sep 30, 2021

Merge pull request #209 from asmeurer/hashtable-fix

6147e9e

Fix a bug with the hashtable introduced by #206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the hashtable dataset exactly as big as it needs to be #206

Make the hashtable dataset exactly as big as it needs to be #206

asmeurer commented Sep 23, 2021

ArvidJB commented Sep 24, 2021

asmeurer commented Sep 24, 2021

Make the hashtable dataset exactly as big as it needs to be #206

Make the hashtable dataset exactly as big as it needs to be #206

Conversation

asmeurer commented Sep 23, 2021

ArvidJB commented Sep 24, 2021

asmeurer commented Sep 24, 2021