-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Empty file taking up lots of disk space #205
Comments
Here are some ideas:
|
Actually I don't think we need to allocate in chunks at all. I think that code is leftover from when we used to write every entry one at a time, but now we batch the write at the end, and I didn't update the pre-allocation code. We can play around with compression. I doubt it will do much as only the "shape" part of the data can be compressed. The "hash" is effectively random data so will be incompressible. |
I agree that the hash is not compressible, but if the hashtable "load" is very low then almost all of the hashes will be zero byte strings of length 32. All the unoccupied spots should compress very well! |
Previously it was always filled with 0s to be default_chunk_size aligned, but this is no longer necessary now that we write to it once at the end. The chunk_size resizing is still used for the in-memory NumPy array. This makes the hashtables much smaller on disk for datasets without many chunks. This keeps the largest_index attribute for backwards compatibility with hash tables that were created before this change (which will automatically be resized down the next time they are written to). It also fixes the documentation of largest_index. Fixes deshaw#205.
Fix at #206 I did not enable compression there as the issue is basically fixed (your example dataset now takes only 11 KB). However, enabling it does save a little bit of space for larger datasets where there is actual data in the hashtable. However, I didn't check how the space change compares to a potential performance penalty of using compression. |
Can we merge #206 and cut a release? |
A versioned .h5 file which contains no data still takes up a lot of disk space. To reproduce this you can create a file like this:
which takes about 200KB of disk space:
It seems this is due to the
hash_table
, every other dataset does not actually end up taking any space:Is there a way to make the
hash_table
take less space?The text was updated successfully, but these errors were encountered: