Header as a HDF5 compound datatype #2

mkitti · 2022-06-06T01:40:37Z

HDF5 has the capability to create a HDF5 compound dataset, which is analogous to a C struct.

https://portal.hdfgroup.org/display/HDF5/Datatype+Basics#DatatypeBasics-compound
https://api.h5py.org/h5t.html#compound-types

It may be possible to also construct this from a NumPy record. I suspect it may be easier to use the low-level API from the CSV files that you created.

mkitti · 2022-06-06T01:44:00Z

h5py then provides the ability to read individual fields directly.
https://docs.h5py.org/en/stable/high/dataset.html?highlight=compound#reading-writing-data

clbarnes · 2022-06-06T09:51:52Z

Presumably this would end up very close to Davis' approach here? https://github.com/janelia-cosem/fibsem-tools/blob/f4bedbfc4ff81ec1b83282908ba6702baf98c734/src/fibsem_tools/io/fibsem.py#L81

It's smart and probably a better representation of what's going on, but is this kind of access standard across common HDF5 implementations? The HDF5 spec is colossal so it wouldn't surprise me if many APIs only cover a subset of its functionality; in that case I'd prefer to target that common subset of "basic" features rather than go deep into the HDF5 spec to find something which is technically allowed but not available to many users.

mkitti · 2022-06-06T15:00:00Z

I was thinking of this as a way to encode the jeiss-convert tsv files as a datatype in HDF5 itself. In the worse case scenario, one could always use H5Dread to just read the bytes giving uint8 as the memory type, which is the status quo.

Many packages support compound datatypes. Perhaps the most common use of compound datatype is complex numbers.

Java: https://bitbucket.hdfgroup.org/pages/HDFFV/hdf5doc/master/browse/html/javadoc/index.html?hdf/hdf5lib/H5.html
MATLAB: https://www.mathworks.com/help/matlab/import_export/import-hdf5-files.html
Julia: https://juliaio.github.io/HDF5.jl/stable/#Supported-data-types

mkitti · 2022-07-19T19:31:24Z

JHDF5 which is currently used by the Java tools BigDataViewer and SciJava (FIJI, etc.) has a compound datatype reader here:
https://svnsis.ethz.ch/doc/hdf5/hdf5-19.04/ch/systemsx/cisd/hdf5/IHDF5CompoundReader.html

mkitti · 2022-07-19T19:36:02Z

@clbarnes , let me know if you have time to chat for a few minutes. One concern about embracing HDF5 for this is that we're not sure if this works for everyone at Cambridge. Albert in particular seemed to prefer text based attributes via JSON or similar.

clbarnes · 2022-07-19T21:12:33Z

I actually have a fork which writes to zarr, which is exactly that - a JSON file for the metadata, plus a npy-esque binary dump (which can be chunked). Zarr is getting a lot of attention but the spec is anticipated to change some time soon, in a way which will make it less convenient for this sort of thing.

I'm flexible for the rest of the week if we can figure out time differences! I'm in BST.

mkitti · 2022-07-19T22:37:46Z

Yes, I participated in the discussion on the Zarr shard specification that should be part of v3:
zarr-developers/zarr-python#876 (comment)
It looks like a HDF5 file with an extra linear dataset could also be a Zarr shard.

Extracting that indexing from HDF5 should be quite fast if we use H5Dchunk_iter currently in HDF5 1.13 or or the h5ls command line utility:
https://docs.hdfgroup.org/hdf5/develop/group___h5_d.html#gac482c2386aa3aea4c44730a627a7adb8

Another extreme is https://github.com/HDFGroup/hdf5-json

Nonetheless, once we have the data in one standard format, I do not mind investing in tooling to move between standard formats or using something like kerchunk. The best part is that tooling may already exist.

clbarnes · 2022-07-26T10:06:58Z

I have an implementation of this with a convenient Mapping wrapper, which round-trips correctly through bytes. What I'm trying to figure out now is where it fits with the rest of program as it currently stands - if the header is written into the HDF5 as this compound dtype array, do we still want to encode the same metadata as attributes, which is the more HDF5-y way to do it? That duplication concerns me a bit. If not, then we've made the attributes a bit more awkward to access. Is having today's header encoded byte-for-byte in the HDF5 file a goal in its own right?

It also gets more complicated to add the zarr/n5 implementations, which don't support compound dtypes (to my knowledge). In these cases, you'd need to serialise the metadata as attributes anyway (which, again, is more convenient for downstream users anyway). I'm not entirely convinced zarr/n5 support is a good way to go anyway - keeping everything contained in the same file and having a single supported workflow from proprietary to open format is of benefit, and given that these files will almost certainly require post-processing, downstream users can write to other formats at that stage if they want.

mkitti · 2022-07-26T15:06:37Z

Is having today's header encoded byte-for-byte in the HDF5 file a goal in its own right?

This was a stated goal of the last round to help ensure round trip durability. Originally it was just going to be an opaque datatype or byte stream, but I realized that we may be able to do better with the compound datatype. We do not want to depend on someone bumping the version number or the accuracy of the reader's table of offsets and types in order to preserve the header.

One option might be to save the 1 KB header as a separate file for reference. For Zarr this might just be an opaque block of bytes. N5 has N5-HDF5 backend that may be able to take advantage of the compound datatype.

clbarnes · 2022-07-27T11:48:08Z

My current implementation does store the raw header as well as the exploded metadata values, without using the compound dtype. For HDF5, there is a u8 attribute "_header" (as well as "_footer"); for the N5 and Zarr implementations in the PR, these are hex-encoded strings (open to using base64 as well). The tests round trip from the exploded values, rather than relying on the byte dump.

The compound dtype is just calculated from the table of offsets and dtypes so isn't any more robust in that respect. I don't think there's a better way to do that which doesn't just duplicate the information and introduce a new source of error. The reader doesn't need to explicitly state the version as it's read from the metadata, and (in my implementation anyway) will fail if the version's spec isn't known.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Header as a HDF5 compound datatype #2

Header as a HDF5 compound datatype #2

mkitti commented Jun 6, 2022

mkitti commented Jun 6, 2022

clbarnes commented Jun 6, 2022 •

edited

Loading

mkitti commented Jun 6, 2022

mkitti commented Jul 19, 2022

mkitti commented Jul 19, 2022

clbarnes commented Jul 19, 2022

mkitti commented Jul 19, 2022

clbarnes commented Jul 26, 2022

mkitti commented Jul 26, 2022

clbarnes commented Jul 27, 2022

Header as a HDF5 compound datatype #2

Header as a HDF5 compound datatype #2

Comments

mkitti commented Jun 6, 2022

mkitti commented Jun 6, 2022

clbarnes commented Jun 6, 2022 • edited Loading

mkitti commented Jun 6, 2022

mkitti commented Jul 19, 2022

mkitti commented Jul 19, 2022

clbarnes commented Jul 19, 2022

mkitti commented Jul 19, 2022

clbarnes commented Jul 26, 2022

mkitti commented Jul 26, 2022

clbarnes commented Jul 27, 2022

clbarnes commented Jun 6, 2022 •

edited

Loading