Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty file taking up lots of disk space #205

Closed
ArvidJB opened this issue Sep 10, 2021 · 5 comments · Fixed by #206
Closed

Empty file taking up lots of disk space #205

ArvidJB opened this issue Sep 10, 2021 · 5 comments · Fixed by #206

Comments

@ArvidJB
Copy link
Collaborator

ArvidJB commented Sep 10, 2021

A versioned .h5 file which contains no data still takes up a lot of disk space. To reproduce this you can create a file like this:

In [1]: import h5py

In [2]: from versioned_hdf5 import VersionedHDF5File

In [3]: with h5py.File('data.h5', 'w') as f:
   ...:     vf = VersionedHDF5File(f)
   ...:     with vf.stage_version('r0') as sv:
   ...:         sv['foo'] = np.array([], dtype='int')
   ...:

which takes about 200KB of disk space:

~> ls -l data.h5
-rw-rw-r-- 1 bessen bessen 210328 Sep 10 09:54 data.h5

It seems this is due to the hash_table, every other dataset does not actually end up taking any space:

~> h5ls -v -r data.h5
...
/_version_data/foo/hash_table Dataset {4096/Inf}
    Attribute: largest_index scalar
        Type:      native long
        Data:  0
    Location:  1:10552
    Links:     1
    Chunks:    {4096} 196608 bytes
    Storage:   196608 logical bytes, 196608 allocated bytes, 100.00% utilization
    Type:      struct {
                   "hash"             +0    [32] native unsigned char
                   "shape"            +32   [2] native long
               } 48 bytes
...

Is there a way to make the hash_table take less space?

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Sep 14, 2021

Here are some ideas:

  • change the hashtable growth algorithm. Right now it allocates in chunk increments. This could be changed to a doubling strategy where we start with a pretty small size.
  • enable compression for the hashtable dataset.

@asmeurer
Copy link
Collaborator

Actually I don't think we need to allocate in chunks at all. I think that code is leftover from when we used to write every entry one at a time, but now we batch the write at the end, and I didn't update the pre-allocation code.

We can play around with compression. I doubt it will do much as only the "shape" part of the data can be compressed. The "hash" is effectively random data so will be incompressible.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Sep 14, 2021

I agree that the hash is not compressible, but if the hashtable "load" is very low then almost all of the hashes will be zero byte strings of length 32. All the unoccupied spots should compress very well!

asmeurer added a commit to asmeurer/versioned-hdf5 that referenced this issue Sep 23, 2021
Previously it was always filled with 0s to be default_chunk_size aligned, but
this is no longer necessary now that we write to it once at the end. The
chunk_size resizing is still used for the in-memory NumPy array.

This makes the hashtables much smaller on disk for datasets without many
chunks.

This keeps the largest_index attribute for backwards compatibility with hash
tables that were created before this change (which will automatically be
resized down the next time they are written to). It also fixes the
documentation of largest_index.

Fixes deshaw#205.
@asmeurer
Copy link
Collaborator

Fix at #206

I did not enable compression there as the issue is basically fixed (your example dataset now takes only 11 KB). However, enabling it does save a little bit of space for larger datasets where there is actual data in the hashtable. However, I didn't check how the space change compares to a potential performance penalty of using compression.

@ArvidJB
Copy link
Collaborator Author

ArvidJB commented Sep 27, 2021

Can we merge #206 and cut a release?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants