Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fch5 an hdf5-based format for storing framecaches #728

Merged
merged 7 commits into from
Nov 14, 2024

Conversation

ChristosT
Copy link
Collaborator

@ChristosT ChristosT commented Nov 8, 2024

The new format allows to store more complex datasets including metadata at multiple levels.

The new fch5 file contains the following fields:

  • shape,nframes,dtype similar to before storing shape as tuple, nframes as integer and dtype of the imageseries as an encoded string.
  • data an (m,1) array of type dtype holding the datavalues of all frames. m is evaluated on runtime based on threshold value
  • indices an (m,2) array of type np.uint16 holding the row and column indices of the datavalues for each frame.
  • frame_ids: (2*nframes,1) holds the range that the i-th frame occupies in the above, i.e. the information of the i-th frame can be accessed using:
data_i = data[frame_ids[2*i]:frame_ids[2*i+1]]
indices_i = indices[frame_ids[2*i]:frame_ids[2*i+1]]

Both the reader and the writer are multi-threaded and perform better in terms of space efficiency, speed and memory consumption during writes.

Improvements:

  • memory consumption: During saving we write the framecache of each frame once it is ready instead of holding everything in memory. This reduces significantly the amount of memory required while processing a dataset for writing:

Maximum memory usage (as reported by GNU time) saving a raw 7.2G file:

  • npz format: 18.63 Gb

  • fch5 format: 1.8 Gb

  • in terms of space efficiency, npz uses zip. For fch5 we bring the hdf5plugin package and chose the blosc compression filter with zstd as the underlying compression algorithm and 5 as the compression level. The choice of the compression configuration was based on trading off between size and speed while being competitive against the current npz writer.

This results in the following sizes for the same raw file:

  • npz: 816 Mb
  • fch5: 661 Mb

and their corresponding saving time:

  • npz: 177.43 s
  • fch5: 106.45 s

Note that similar to the npz case we store datavalues in their native format so they can be either integers or floats.

@pep8speaks
Copy link

pep8speaks commented Nov 8, 2024

Hello @ChristosT! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-11-13 23:27:13 UTC

@ChristosT ChristosT force-pushed the add-hdf5-support branch 5 times, most recently from b6dee58 to 4cd1453 Compare November 11, 2024 14:43
@ChristosT ChristosT changed the title Draft: Add fch5 an hdf5-based format for storing framecaches Add fch5 an hdf5-based format for storing framecaches Nov 11, 2024
@ChristosT
Copy link
Collaborator Author

@psavery

hexrd/imageseries/load/framecache.py Outdated Show resolved Hide resolved
hexrd/imageseries/load/framecache.py Outdated Show resolved Hide resolved
hexrd/imageseries/load/framecache.py Outdated Show resolved Hide resolved
hexrd/imageseries/load/framecache.py Outdated Show resolved Hide resolved
setup.py Show resolved Hide resolved
h5f["nframes"] = nframes
h5f["dtype"] = str(self._ims.dtype).encode()
metadata = h5f.create_group("metadata")
unwrap_dict_to_h5(metadata, self._process_meta())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unwrap_dict_to_h5(metadata, self._process_meta())
unwrap_dict_to_h5(metadata, self._meta)

Can you just do this instead? I think we don't need self._process_meta() for the fch5 format.

We should also verify that this saves nested metadata correctly!

hexrd/imageseries/save.py Outdated Show resolved Hide resolved
Gives access to a plethora of compression algorithms for hdf5 files
.fch5 is an hdf5-based format for saving framecaches comprised mainly 3 datasets

- 'data': (m,1) array holding the datavalues of all frames. `m` is
  evaluated upon runtime
- 'indices': (m,2) array holding the row& col information for the
  values in data. 'data' together within 'indices' represent tha data
  using the CSR format for sparse matrices.
- 'frame_ids`: (2*nframes)  holds the range that the i-th frame
  occupies in the above arrays. i.e. the information of the i-th frame
  can be accessed using:

  data_i = data[frame_ids[2*i]:frame_ids[2*i+1]] and
  indices_i = indices[frame_ids[2*i]:frame_ids[2*i+1]]
Comment on lines 140 to 149

def test_fmtfc_nested_metadata(self):
"""frame-cache format with nested metadata"""
metadata = { 'int': 1, 'array': np.array([1,2,3])}
self.is_a.metadata["key"] = metadata

imageseries.write(self.is_a, self.fcfile, self.fmt, style=self.style,
threshold=self.thresh, cache_file=self.cache_file
)
is_fc = imageseries.open(self.fcfile, self.fmt, style=self.style)
self.assertTrue(compare_meta(self.is_a, is_fc))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for checking nested metadata support in the new format.

@ChristosT
Copy link
Collaborator Author

This is tested with the gui and works as expected.

Copy link
Collaborator

@psavery psavery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@ChristosT ChristosT merged commit 9c595e1 into HEXRD:master Nov 14, 2024
6 checks passed
@ChristosT ChristosT deleted the add-hdf5-support branch November 14, 2024 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants