Add fch5 an hdf5-based format for storing framecaches #728

ChristosT · 2024-11-08T23:57:52Z

The new format allows to store more complex datasets including metadata at multiple levels.

The new fch5 file contains the following fields:

shape,nframes,dtype similar to before storing shape as tuple, nframes as integer and dtype of the imageseries as an encoded string.
data an (m,1) array of type dtype holding the datavalues of all frames. m is evaluated on runtime based on threshold value
indices an (m,2) array of type np.uint16 holding the row and column indices of the datavalues for each frame.
frame_ids: (2*nframes,1) holds the range that the i-th frame occupies in the above, i.e. the information of the i-th frame can be accessed using:

data_i = data[frame_ids[2*i]:frame_ids[2*i+1]]
indices_i = indices[frame_ids[2*i]:frame_ids[2*i+1]]

Both the reader and the writer are multi-threaded and perform better in terms of space efficiency, speed and memory consumption during writes.

Improvements:

memory consumption: During saving we write the framecache of each frame once it is ready instead of holding everything in memory. This reduces significantly the amount of memory required while processing a dataset for writing:

Maximum memory usage (as reported by GNU time) saving a raw 7.2G file:

npz format: 18.63 Gb
fch5 format: 1.8 Gb
in terms of space efficiency, npz uses zip. For fch5 we bring the hdf5plugin package and chose the blosc compression filter with zstd as the underlying compression algorithm and 5 as the compression level. The choice of the compression configuration was based on trading off between size and speed while being competitive against the current npz writer.

This results in the following sizes for the same raw file:

npz: 816 Mb
fch5: 661 Mb

and their corresponding saving time:

npz: 177.43 s
fch5: 106.45 s

Note that similar to the npz case we store datavalues in their native format so they can be either integers or floats.

pep8speaks · 2024-11-08T23:57:59Z

Hello @ChristosT! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-11-13 23:27:13 UTC

ChristosT · 2024-11-11T15:38:52Z

@psavery

hexrd/imageseries/load/framecache.py

setup.py

psavery · 2024-11-11T22:30:00Z

hexrd/imageseries/save.py

+            h5f["nframes"] = nframes
+            h5f["dtype"] = str(self._ims.dtype).encode()
+            metadata = h5f.create_group("metadata")
+            unwrap_dict_to_h5(metadata, self._process_meta())


Suggested change

unwrap_dict_to_h5(metadata, self._process_meta())

unwrap_dict_to_h5(metadata, self._meta)

Can you just do this instead? I think we don't need self._process_meta() for the fch5 format.

We should also verify that this saves nested metadata correctly!

hexrd/imageseries/save.py

Gives access to a plethora of compression algorithms for hdf5 files

.fch5 is an hdf5-based format for saving framecaches comprised mainly 3 datasets - 'data': (m,1) array holding the datavalues of all frames. `m` is evaluated upon runtime - 'indices': (m,2) array holding the row& col information for the values in data. 'data' together within 'indices' represent tha data using the CSR format for sparse matrices. - 'frame_ids`: (2*nframes) holds the range that the i-th frame occupies in the above arrays. i.e. the information of the i-th frame can be accessed using: data_i = data[frame_ids[2*i]:frame_ids[2*i+1]] and indices_i = indices[frame_ids[2*i]:frame_ids[2*i+1]]

ChristosT · 2024-11-12T14:41:18Z

tests/imageseries/test_formats.py

+
+    def test_fmtfc_nested_metadata(self):
+        """frame-cache format with nested metadata"""
+        metadata = { 'int': 1, 'array': np.array([1,2,3])}
+        self.is_a.metadata["key"] = metadata
+
+        imageseries.write(self.is_a, self.fcfile, self.fmt, style=self.style,
+            threshold=self.thresh, cache_file=self.cache_file
+        )
+        is_fc = imageseries.open(self.fcfile, self.fmt, style=self.style)
+        self.assertTrue(compare_meta(self.is_a, is_fc))


Added test for checking nested metadata support in the new format.

ChristosT · 2024-11-13T23:41:22Z

This is tested with the gui and works as expected.

psavery

Looks good to me!

Add new optional style argument in FrameCache writer

4bb4280

ChristosT force-pushed the add-hdf5-support branch 5 times, most recently from b6dee58 to 4cd1453 Compare November 11, 2024 14:43

ChristosT changed the title ~~Draft: Add fch5 an hdf5-based format for storing framecaches~~ Add fch5 an hdf5-based format for storing framecaches Nov 11, 2024

psavery reviewed Nov 11, 2024

View reviewed changes

hexrd/imageseries/save.py Outdated Show resolved Hide resolved

ChristosT added 4 commits November 12, 2024 09:23

Add hdf5plugin to requirements and conda meta.yml

43f23ad

Gives access to a plethora of compression algorithms for hdf5 files

fch5: add parallel reader

ab11ca4

fch5: Add tests

a4968ce

ChristosT force-pushed the add-hdf5-support branch from 4cd1453 to a4968ce Compare November 12, 2024 14:24

ChristosT commented Nov 12, 2024

View reviewed changes

ChristosT force-pushed the add-hdf5-support branch from e16a42b to 6137c15 Compare November 12, 2024 14:47

fch5: Add test with nested metadata

01e6bbe

ChristosT force-pushed the add-hdf5-support branch from 6137c15 to 01e6bbe Compare November 12, 2024 14:48

fch5: save dtype as encoded string

301a673

ChristosT mentioned this pull request Nov 13, 2024

Read fch5 files as framecache HEXRD/hexrdgui#1759

Merged

psavery approved these changes Nov 14, 2024

View reviewed changes

ChristosT merged commit 9c595e1 into HEXRD:master Nov 14, 2024
6 checks passed

ChristosT deleted the add-hdf5-support branch November 14, 2024 15:10

psavery mentioned this pull request Nov 26, 2024

Fix sparsing from eiger stream files #709

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fch5 an hdf5-based format for storing framecaches #728

Add fch5 an hdf5-based format for storing framecaches #728

ChristosT commented Nov 8, 2024 •

edited

Loading

pep8speaks commented Nov 8, 2024 •

edited

Loading

ChristosT commented Nov 11, 2024

psavery Nov 11, 2024

ChristosT Nov 12, 2024

ChristosT commented Nov 13, 2024

psavery left a comment

	unwrap_dict_to_h5(metadata, self._process_meta())
	unwrap_dict_to_h5(metadata, self._meta)

Add fch5 an hdf5-based format for storing framecaches #728

Add fch5 an hdf5-based format for storing framecaches #728

Conversation

ChristosT commented Nov 8, 2024 • edited Loading

pep8speaks commented Nov 8, 2024 • edited Loading

Comment last updated at 2024-11-13 23:27:13 UTC

ChristosT commented Nov 11, 2024

psavery Nov 11, 2024

Choose a reason for hiding this comment

ChristosT Nov 12, 2024

Choose a reason for hiding this comment

ChristosT commented Nov 13, 2024

psavery left a comment

Choose a reason for hiding this comment

ChristosT commented Nov 8, 2024 •

edited

Loading

pep8speaks commented Nov 8, 2024 •

edited

Loading