-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Header as a HDF5 compound datatype #2
Comments
h5py then provides the ability to read individual fields directly. |
Presumably this would end up very close to Davis' approach here? https://github.com/janelia-cosem/fibsem-tools/blob/f4bedbfc4ff81ec1b83282908ba6702baf98c734/src/fibsem_tools/io/fibsem.py#L81 It's smart and probably a better representation of what's going on, but is this kind of access standard across common HDF5 implementations? The HDF5 spec is colossal so it wouldn't surprise me if many APIs only cover a subset of its functionality; in that case I'd prefer to target that common subset of "basic" features rather than go deep into the HDF5 spec to find something which is technically allowed but not available to many users. |
I was thinking of this as a way to encode the jeiss-convert tsv files as a datatype in HDF5 itself. In the worse case scenario, one could always use H5Dread to just read the bytes giving Many packages support compound datatypes. Perhaps the most common use of compound datatype is complex numbers. Java: https://bitbucket.hdfgroup.org/pages/HDFFV/hdf5doc/master/browse/html/javadoc/index.html?hdf/hdf5lib/H5.html |
JHDF5 which is currently used by the Java tools BigDataViewer and SciJava (FIJI, etc.) has a compound datatype reader here: |
@clbarnes , let me know if you have time to chat for a few minutes. One concern about embracing HDF5 for this is that we're not sure if this works for everyone at Cambridge. Albert in particular seemed to prefer text based attributes via JSON or similar. |
I actually have a fork which writes to zarr, which is exactly that - a JSON file for the metadata, plus a npy-esque binary dump (which can be chunked). Zarr is getting a lot of attention but the spec is anticipated to change some time soon, in a way which will make it less convenient for this sort of thing. I'm flexible for the rest of the week if we can figure out time differences! I'm in BST. |
Yes, I participated in the discussion on the Zarr shard specification that should be part of v3: Extracting that indexing from HDF5 should be quite fast if we use Another extreme is https://github.com/HDFGroup/hdf5-json Nonetheless, once we have the data in one standard format, I do not mind investing in tooling to move between standard formats or using something like kerchunk. The best part is that tooling may already exist. |
I have an implementation of this with a convenient It also gets more complicated to add the zarr/n5 implementations, which don't support compound dtypes (to my knowledge). In these cases, you'd need to serialise the metadata as attributes anyway (which, again, is more convenient for downstream users anyway). I'm not entirely convinced zarr/n5 support is a good way to go anyway - keeping everything contained in the same file and having a single supported workflow from proprietary to open format is of benefit, and given that these files will almost certainly require post-processing, downstream users can write to other formats at that stage if they want. |
This was a stated goal of the last round to help ensure round trip durability. Originally it was just going to be an opaque datatype or byte stream, but I realized that we may be able to do better with the compound datatype. We do not want to depend on someone bumping the version number or the accuracy of the reader's table of offsets and types in order to preserve the header. One option might be to save the 1 KB header as a separate file for reference. For Zarr this might just be an opaque block of bytes. N5 has N5-HDF5 backend that may be able to take advantage of the compound datatype. |
My current implementation does store the raw header as well as the exploded metadata values, without using the compound dtype. For HDF5, there is a u8 attribute The compound dtype is just calculated from the table of offsets and dtypes so isn't any more robust in that respect. I don't think there's a better way to do that which doesn't just duplicate the information and introduce a new source of error. The reader doesn't need to explicitly state the version as it's read from the metadata, and (in my implementation anyway) will fail if the version's spec isn't known. |
HDF5 has the capability to create a HDF5 compound dataset, which is analogous to a C struct.
https://portal.hdfgroup.org/display/HDF5/Datatype+Basics#DatatypeBasics-compound
https://api.h5py.org/h5t.html#compound-types
It may be possible to also construct this from a NumPy record. I suspect it may be easier to use the low-level API from the CSV files that you created.
The text was updated successfully, but these errors were encountered: