Empty file taking up lots of disk space #205

ArvidJB · 2021-09-10T14:00:17Z

A versioned .h5 file which contains no data still takes up a lot of disk space. To reproduce this you can create a file like this:

In [1]: import h5py

In [2]: from versioned_hdf5 import VersionedHDF5File

In [3]: with h5py.File('data.h5', 'w') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     with vf.stage_version('r0') as sv:
   ...:         sv['foo'] = np.array([], dtype='int')
   ...:

which takes about 200KB of disk space:

~> ls -l data.h5
-rw-rw-r-- 1 bessen bessen 210328 Sep 10 09:54 data.h5

It seems this is due to the hash_table, every other dataset does not actually end up taking any space:

~> h5ls -v -r data.h5
...
/_version_data/foo/hash_table Dataset {4096/Inf}
    Attribute: largest_index scalar
        Type:      native long
        Data:  0
    Location:  1:10552
    Links:     1
    Chunks:    {4096} 196608 bytes
    Storage:   196608 logical bytes, 196608 allocated bytes, 100.00% utilization
    Type:      struct {
                   "hash"             +0    [32] native unsigned char
                   "shape"            +32   [2] native long
               } 48 bytes
...

Is there a way to make the hash_table take less space?

The text was updated successfully, but these errors were encountered:

ArvidJB · 2021-09-14T18:28:22Z

Here are some ideas:

change the hashtable growth algorithm. Right now it allocates in chunk increments. This could be changed to a doubling strategy where we start with a pretty small size.
enable compression for the hashtable dataset.

asmeurer · 2021-09-14T19:34:58Z

Actually I don't think we need to allocate in chunks at all. I think that code is leftover from when we used to write every entry one at a time, but now we batch the write at the end, and I didn't update the pre-allocation code.

We can play around with compression. I doubt it will do much as only the "shape" part of the data can be compressed. The "hash" is effectively random data so will be incompressible.

ArvidJB · 2021-09-14T21:13:44Z

I agree that the hash is not compressible, but if the hashtable "load" is very low then almost all of the hashes will be zero byte strings of length 32. All the unoccupied spots should compress very well!

Previously it was always filled with 0s to be default_chunk_size aligned, but this is no longer necessary now that we write to it once at the end. The chunk_size resizing is still used for the in-memory NumPy array. This makes the hashtables much smaller on disk for datasets without many chunks. This keeps the largest_index attribute for backwards compatibility with hash tables that were created before this change (which will automatically be resized down the next time they are written to). It also fixes the documentation of largest_index. Fixes deshaw#205.

asmeurer · 2021-09-23T22:54:00Z

Fix at #206

I did not enable compression there as the issue is basically fixed (your example dataset now takes only 11 KB). However, enabling it does save a little bit of space for larger datasets where there is actual data in the hashtable. However, I didn't check how the space change compares to a potential performance penalty of using compression.

ArvidJB · 2021-09-27T16:13:51Z

Can we merge #206 and cut a release?

asmeurer mentioned this issue Sep 23, 2021

Make the hashtable dataset exactly as big as it needs to be #206

Merged

asmeurer closed this as completed in #206 Sep 27, 2021

ericdatakelly added this to the September 2021 milestone Sep 29, 2021

ArvidJB mentioned this issue Oct 19, 2021

Enable compression for hash_table #210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty file taking up lots of disk space #205

Empty file taking up lots of disk space #205

ArvidJB commented Sep 10, 2021 •

edited

Loading

ArvidJB commented Sep 14, 2021

asmeurer commented Sep 14, 2021

ArvidJB commented Sep 14, 2021

asmeurer commented Sep 23, 2021

ArvidJB commented Sep 27, 2021

Empty file taking up lots of disk space #205

Empty file taking up lots of disk space #205

Comments

ArvidJB commented Sep 10, 2021 • edited Loading

ArvidJB commented Sep 14, 2021

asmeurer commented Sep 14, 2021

ArvidJB commented Sep 14, 2021

asmeurer commented Sep 23, 2021

ArvidJB commented Sep 27, 2021

ArvidJB commented Sep 10, 2021 •

edited

Loading