-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fch5 an hdf5-based format for storing framecaches #728
Conversation
Hello @ChristosT! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2024-11-13 23:27:13 UTC |
b6dee58
to
4cd1453
Compare
hexrd/imageseries/save.py
Outdated
h5f["nframes"] = nframes | ||
h5f["dtype"] = str(self._ims.dtype).encode() | ||
metadata = h5f.create_group("metadata") | ||
unwrap_dict_to_h5(metadata, self._process_meta()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unwrap_dict_to_h5(metadata, self._process_meta()) | |
unwrap_dict_to_h5(metadata, self._meta) |
Can you just do this instead? I think we don't need self._process_meta()
for the fch5 format.
We should also verify that this saves nested metadata correctly!
Gives access to a plethora of compression algorithms for hdf5 files
.fch5 is an hdf5-based format for saving framecaches comprised mainly 3 datasets - 'data': (m,1) array holding the datavalues of all frames. `m` is evaluated upon runtime - 'indices': (m,2) array holding the row& col information for the values in data. 'data' together within 'indices' represent tha data using the CSR format for sparse matrices. - 'frame_ids`: (2*nframes) holds the range that the i-th frame occupies in the above arrays. i.e. the information of the i-th frame can be accessed using: data_i = data[frame_ids[2*i]:frame_ids[2*i+1]] and indices_i = indices[frame_ids[2*i]:frame_ids[2*i+1]]
4cd1453
to
a4968ce
Compare
tests/imageseries/test_formats.py
Outdated
|
||
def test_fmtfc_nested_metadata(self): | ||
"""frame-cache format with nested metadata""" | ||
metadata = { 'int': 1, 'array': np.array([1,2,3])} | ||
self.is_a.metadata["key"] = metadata | ||
|
||
imageseries.write(self.is_a, self.fcfile, self.fmt, style=self.style, | ||
threshold=self.thresh, cache_file=self.cache_file | ||
) | ||
is_fc = imageseries.open(self.fcfile, self.fmt, style=self.style) | ||
self.assertTrue(compare_meta(self.is_a, is_fc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test for checking nested metadata support in the new format.
e16a42b
to
6137c15
Compare
6137c15
to
01e6bbe
Compare
This is tested with the gui and works as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
The new format allows to store more complex datasets including metadata at multiple levels.
The new fch5 file contains the following fields:
shape
,nframes
,dtype
similar to before storing shape as tuple, nframes as integer and dtype of the imageseries as an encoded string.data
an (m,1) array of typedtype
holding the datavalues of all frames.m
is evaluated on runtime based onthreshold
valueindices
an (m,2) array of typenp.uint16
holding the row and column indices of the datavalues for each frame.frame_ids
: (2*nframes,1) holds the range that the i-th frame occupies in the above, i.e. the information of the i-th frame can be accessed using:Both the reader and the writer are multi-threaded and perform better in terms of space efficiency, speed and memory consumption during writes.
Improvements:
Maximum memory usage (as reported by GNU time) saving a raw 7.2G file:
npz format: 18.63 Gb
fch5 format: 1.8 Gb
in terms of space efficiency, npz uses
zip
. Forfch5
we bring the hdf5plugin package and chose theblosc
compression filter withzstd
as the underlying compression algorithm and5
as the compression level. The choice of the compression configuration was based on trading off between size and speed while being competitive against the currentnpz
writer.This results in the following sizes for the same raw file:
and their corresponding saving time:
Note that similar to the npz case we store datavalues in their native format so they can be either integers or floats.