-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Proposal for multidimensional array file format #197
Comments
Note that there are two versions of npy, so compatibility levels of 1 and 2 should be assessed. My primary concern with My primary concern with HDF5 is that it's just a container, and we will find ourselves defining formats. Perhaps just saying it contains only a dataset with name |
in a different world, but probably related to computational models BRAIN has funded development of the NWB standard. to the extent that needs may become similar, it may be worthwhile thinking about supporting NWB in BIDS. this will make the metadata world both easier (included in the NWB file) and harder (non conformant with BIDS), depending on your point of view. however, the NWB folks are also considering alternatives like exdir, which is like HDF5 but with external metadata and binary blobs as numpy files. |
Sorry: could I ask for a bit more context? What kind of data will be stored in these files? If it's large enough to justify parallel processing of its contents, allow me to throw in a plea to consider zarr compatibility. I think that HDF5 could be made to play nice with zarr. |
@satra In principle that seems fine, but their HDF5 format looks basically like HDF5 + some mandatory metadata, so if flexibility is a potential downside, it persists. If it's not a downside, then I have no principled objection. @arokem The issue driving us here is less the size of the data than the dimensionality. That said, there's no reason that the files couldn't get large enough for random and parallel access to be concerns, which is why I think HDF5 is my inclination (despite my above-noted reservations). The goal is wide interoperability (in particular, C, R, MATLAB and Python) and not reinventing the wheel, so if that format fits, I for one am happy to consider it. |
@arokem - the NWB folks are also considering zarr compatibility, especially with the N5 API. which would also constrain HDF5, since N5 doesn't support all aspects of it. |
Yup. For reference: NeurodataWithoutBorders/pynwb#230 |
On one hand I am in strong favor of reusing someone else's "schema" and possibly "tooling" on top of HDF5 (container)! NWB might (do not know how well it aligns with the needs of ComputationalModels metadata) be a good choice. Import/export "compatibility" with other computation-oriented formats (like zarr) might be a plus. BUT thinking in conjunction with 2. -- if we choose a "single file" container format to absorb both (data and metadata), we would step a bit away from "human accessibility" of BIDS. We already have an issue of metadata location duality, e.g. it being present in the data files (nii.gz) headers -- "for machines", and some (often additional but some times just duplicate) in side car files -- "for machines and humans" (related - recent #196). Sure thing bids-validator could assure consistency, but we subconsciously trying to avoid such redundancy, and I wonder if that might still be a way to keep going. May be there is a lightweight format (or some basic "schema" for HDF5) which would not aim to store any possible metadata, but just store minimally sufficient for easy and unambiguous IO of multi-dimensional arrays (if that is the goal here). And then pair it up with the side car .json file convenient access to metadata (defined in BIDS, if there is no existing schema for "ComputationalModels" elsewhere to reuse; not duplicated in the actual data file) for easy human and machines use (without requiring to open the actual data file which would require tooling)? If we end up with a single file format to contain both -- I think we might need to extract/duplicate metadata in a sidecar file anyways for easier human (and at times tools) consumption. |
@yarikoptic sorry, I realize on re-read that I wasn't clear, but your
proposed approach (putting metadata in the json sidecar and only the raw
ndarray in the binary file) is exactly what we seemed to converge on at the
end of the BIDS-CM meeting. (I.e., the sidecar would supply the metadata needed to interpret the the common-format array appropriately for the use case specified in the suffix.)
…On Fri, Apr 5, 2019, 17:29 Yaroslav Halchenko ***@***.***> wrote:
On one hand I am in strong favor of reusing someone else's "schema" and
possibly "tooling" on top of HDF5 (container)! NWB might (do not know how
well it aligns with the needs of ComputationalModels metadata) be a good
choice. Import/export "compatibility" with other computation-oriented
formats (like zarr) might be a plus.
BUT thinking in conjunction with 2. -- if we choose a "single file"
container format to absorb both (data and metadata), we would step a bit
away from "human accessibility" of BIDS. We already have an issue of
metadata location duality, e.g. it being present in the data files (nii.gz)
headers -- "for machines", and some (often additional but some times just
duplicate) in side car files -- "for machines and humans" (related - recent
#196 <#196>).
Sure thing bids-validator could assure consistency, but we subconsciously
trying to avoid such redundancy, and I wonder if that might still be a way
to keep going. May be there is a lightweight format (or some basic "schema"
for HDF5) which would not aim to store any possible metadata, but just
store minimally sufficient for easy and unambiguous IO of multi-dimensional
arrays (if that is the goal here). And then pair it up with the side car
.json file convenient access to metadata (defined in BIDS, if there is no
existing schema for "ComputationalModels" elsewhere to reuse; not
duplicated in the actual data file) for easy human and machines use
(without requiring to open the actual data file which would require
tooling)? If we end up with a single file format to contain both -- I think
we might need to extract/duplicate metadata in a sidecar file anyways for
easier human (and at times tools) consumption.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#197 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASjPPjWbCVDMwgkTtTQsKJka3ffwVxeks5vd8AsgaJpZM4cfliR>
.
|
I am delighted to hear that similar minded us independently decided to contribute the XXXX-th model of the wheel to the humanity! FWIW, I ran into https://news.ycombinator.com/item?id=10858189 on https://cyrille.rossant.net/moving-away-hdf5/ (even @chrisgorgo himself commented on there) -- seems a good number of groups/projects ended up switching from HDF5 to some ad-hoc data blob + metadata files "format". May be it would be worth stating the desired features (I think those weren't mentioned)? e.g. which among the following would be most important?
or in other words - aiming for processing or archival? if aiming for archival - probably compression is heavily desired... may be could be optional (we already have both .nii and .nii.gz supported IIRC, so could be |
@yarikoptic - be careful of that blog post (i think it leads a lot of people astray), and do read all the threads that have emanated from it. for every such use case its easy to point to MATLAB and say that they use it for their base data format. also there are enough posts out there to also say that people who moved away ended up requiring many of the facilities of hdf5 and switching back to it. finally you should take a look at exdir and zarr as well as pointed in earlier threads, and in this followup thread to cyrille's original post and it's comments including the earliest one by konrad hinson (https://cyrille.rossant.net/should-you-use-hdf5/). at the end of the day it's mostly about blobs and metadata. what one uses to house and link these things is indeed going to keep evolving depending on use cases. so i think the important thing is to think of the use cases, in both short term and to the extent possible longer term. i like the questions that you have raised, and i think more than the format itself, the thought process should be around those required features, including archiving. i'm not saying hdf5 is the answer here nor am i saying hdf5 is issue free, but i have also used it through MATLAB and Python over many years, for my use cases, without an issue. i would need to know their specific goals, applications, and use cases to make an informed judgment. |
We've made simple use of HDF5 (often just one or two datasets) for heavy numerical data (well, MB to TBs) in TVB, a computational modeling environment, for the last 7 years without the problems cited in Rossant's blogpost, mainly by keeping usage simple and heavily vetting library usage prior to version changes. I'd expect transparent compression (lz4 has nearly no CPU overhead) and memmapping are particularly useful for BIDS CM. |
I've asked the participants in the computational models meeting to contribute their specific use cases, but I'll try to summarize according to my memory.
I think there were a couple other examples, but as it became clear that some kind of multidimensional array would likely be the result, we did not compile a specific enumeration of all the needed properties, so hopefully we'll get some feedback. Perhaps @maedoc can clarify the TVB uses that aren't suited to TSV/NIfTI, and what their minimal criteria and additional desiderata are. |
Surfaces & sparse matrices come to mind; these have straightforward serializations to arrays, so I would specify conventions for the serialization (e.g. faces, triangles, 0-based; sparse CSR, 0-based) instead of worrying about a new format. |
Surfaces will be covered in GIFTI. What do you currently use HDF5 for? |
We don't use HDF5 for relational metadata, which is stored in an SQL DB and sidecar XML files, but just about everything else. |
Okay. To get back to @yarikoptic's desiderata:
Agreed, this is most important IMO.
I see these three as basically related. Whether you want slicing for parallel access or just to avoid loading a ton of memory, if this isn't provided, the thing people are going to do is immediately convert to something that can be chunked for good performance over the desired slices and
I guess I'd say it should be an option. There are dense data that are difficult to compress where I may be prematurely pessimistic, but I don't see much hope for pleasing even a simple majority of people with any of the choices discussed here. (I may be projecting and it is just the case that I won't be pleased by my prediction of the majority's choice.) Another option to consider is not requiring a specific binary format, letting apps deal with the choice, and wait for some consensus to emerge in the community. If in a few years all MD arrays are, say, I would then add these conditions:
|
JSON is hardly ideal, but once it's chosen, use cases and implementations can get done, exploring the positives/negatives of the choice. You should just declare a fiat format ( |
Well, if we can consider JSON an acceptable choice, then I would probably just push on with |
I just want to let everyone know I am currently working on a new neuroimaging data interchange format, called JNIfTI. My current draft of the file specification can be found at https://github.com/fangq/jnifti/ together you can find a matlab nifti-1/2 to jnifti converter and jnii/bnii data samples. https://github.com/fangq/jnifti/blob/master/lib/matlab/nii2jnii.m The basic idea is to use JSON and binary JSON (UBJSON) format to store complex scientific data, and completely get rid of a rigid, difficult-to-extend binary header. This makes the data readable, easy to extend and mixing with scientific data from other domains (like multi-modal data, physiology recordings, or computational models etc). There are also numerous JSON/UBJSON parsers out there, so, without writing any new code, a JNIfTI file can be readily parsed by these existing codes. JNIfTI is designed with a compatibility layer to 100% translate the NIFTI-1/2 header/data/extension to the new structure, but once it is moved to JSON, you gain enormous flexibility to add new metadata, header info, organizing multiple datasets inside one document etc. I'd love to hear from this community, what additional information that are current lacking, and happy to accept proposals on defining new "required" metadata headers in to this format. My preference is to gradually shift the main metadata container from the https://github.com/fangq/jnifti/blob/master/JNIfTI_specification.md#structure-form look forward to hearing from you. PS: The foundation of the JNIfTI format is another specification called JData - a proposal to systematically serialize complex data structures, such as N-D arrays, trees, graphs, linked lists etc. The JData specification is current in Draft 1, and can be found at |
I'm also all for @yarikoptic approach. Note that electrophys derivatives have the same issue with processed data typically in a different format, and we need a common ground. I discussed HDF5 with @GaelVaroquaux who have a strong opinion against it (maybe he can comment on that). I'm sure @jasmainak made a table of pros and cons of various format already - but I cannot find it? |
as an additional point, I was wondering if you should state somewhere in the specification that any derived data that can be stored using the native format must do so (eg keep nii as long as possible and do not start using the 'what ever' over format we decide to support as well) |
I discussed HDF5 with @GaelVaroquaux who have a strong opinion against it
I don't have a strong opinion against it. I just look at the past. A
format using it was proposed years ago in the community. It was rejected
by major actors because of the cost of supporting it.
|
@CPernet I'm not hearing anybody clamoring for HDF5, and several voices at least wary of it. My inclination at this point is to push on with If we do want to resume consideration of options, I can start a list of pros/cons: HDF5Pros:
Cons:
The former can be addressed by the spec and easily validated. And it's possible that parsing an HDF5 file with a single data blob would not be very problematic for an independent implementation. npyPros:
Cons:
I think that might be going a bit far. For instance, per-ROI time series could be encoded in NIfTI, but not very naturally. TSV would make more sense, but a strict reading of this proposed rule would lend itself to contorting to keep things in NIfTI. But the overall sentiment seems reasonable. I think a simple statement along those lines, but with a SHOULD, such that any deviation would need to be made with good reason, would be useful guidance. |
This is offset by HFD5 being a single, C-based, strictly versioned API/ABI implementation deal, e.g. a browser based app can't ingest these files, a JVM app has to go through JNI, Julians who want pure Julia stuff won't be happy, etc.
Is offset by simple format; asking for simple, fast & small is greedy (have you ever listened to the clock tick while running
You don't have to call it NumPy if you reproduce the definition as part of the standard; NumPy "compatibility" falls out as a happy side effect. If NumPy project decides to change formats down the line, you avoid another problem |
Following @GaelVaroquaux 'weak' opinion :-) if maintenance is an issue we should not go for HDF5. |
Happy with having a statement and use SHOULD (I was not actually thinking .nii that much but .edf for electrophys) |
A few quick comments:
and compression via a
|
@maedoc Thanks for those thoughts.
This is a pretty strong argument against HDF5, IMO. The Javascript validator is critical BIDS infrastructure, so specifying something it can't validate seems like a bad move. There are NodeJS bindings, so one option would be for the browser to warn on ndarrays and say "Use the CLI to fully validate." I don't really like it, but that's an option. I'm not sure that a distaste for C bindings among some language partisans should be a significant criterion. It's obviously not ideal, but I don't think there are ideal solutions, here.
We haven't done something like this, up to this point. Referencing existing standards has been BIDS' modus operandi, and I think changing that shouldn't be done lightly. We can specify a given version of
Unfortunately, there isn't really a language agnostic format for basic, typed, n-dimensional arrays. |
Not that I'm too keen on HDF5 but cannot we expect this to be solved with WebAssembly? And this makes me come across yet another project... |
Is there a reason that the Arrow format isn't being considered |
@Tokazama - are there examples of storing n-d arrays in arrow? and doing chunked (space and/or time) operations on them? |
Arrow is fundamentally a tabular format. You can put "tensors" as items in Arrow column. But Arrow has no way to represent these as chunks of a larger array, nor does it allow the notion of chunking across multiple dimensions. |
Any format that supports memory mapping (no inscrutable compression) should be fine for multi-dimensional chunking. Chunking is only an issue with table like data where you have non-uniform bits encoding that you need to somehow change the stream to encode differently periodically. I'm not sure what operations besides reading and writing you want. Once it is loaded in any language it should be treated like any other array data.
Well, that's how it's most often used but I definitely use it to store non-table data. Is the goal here to find a single format that can be used to represent multidimensional tables and arrays, on disk and on server so we don't have to mess with all the file types we have now? That seems like a tall order (perhaps unrealistic). |
The goal is always the same, share data in a way that is easily understood - nobody says one has to compute with the same format ... IMO this is out of scope to focus on other aspects but accessibility (ie inter language support) |
The goal is to generalize over TSV, which permits a collection of named (via the column headers) 1D arrays. Zarr/h5 permit named ND arrays. The isomorphism ensures that many languages can use a single API to access either. This is a new file type (to BIDS) and is not intended to replace other file types. Last time arrow came up, I did explore it and was able to save/load ND arrays with its Python API. If I recall correctly, this feature was not uniformly implemented for all languages. |
Previously, Chris (@neurolabusc) had created a benchmark project with sample data to allow comparisons of sizes and saving/loading speed of various surface-mesh-related formats. I thought that was a great idea. would the group be interested in setting up something like that so that everyone can see/test the pros and cons of various volumetric data formats mentioned above? this also gives people a clear expectation on the type of data structures that this discussion is aimed to generalize. |
Benchmarks are great but this discussion is a bit all over the place. Tables of metadata and multidimensional data have separate issues to be considered. Outperforming a human readable text based format for tables is trivial, so comparison of CSV or TSV to anything isnt necessary. If we can agree on a well established byte storage format for tables the that's a no brainer. ND array formats are more difficult to motivate because it's another format that BIDS compliant software would need to have some level of support for and we already have a method for storing those that can be efficient. If you're looking at mesh formats, then that should be its own venture to converge on a standard. Perhaps I'm the odd man out here, and it's perfectly clear to everyone else. If so my apologies. Otherwise this whole thing needs to be pinned down to a single objective in order to move forward |
it may be good to revive this discussion as i'm seeing a few upcoming use cases that will require a more sophisticated consideration for many things that are now in TSVs. here is a temporary proposal to narrow down the conversation.
there are also similar approaches in the increasingly API-based and commercial offerings for the world of storage (polars - columnar storage for table like data, arraylake - version controlled zarr in the cloud) |
I am +1 for parquet to be adopted for any TSV data files (physio, stim, motion, blood). It's an open spec with broad implementation and readily available command-line tools for inspection. I think it should probably be discouraged if not prohibited for metadata files ( I am not sure that there is an actual "to-do" here for N-dimensional named arrays except to adopt them in principle so that a BEP that needs this structure can use it. I do not think there is any call to allow an |
#1614 is what we proposed for zarr and hdf5 as we (almost all here in this thread) have discussed -- was review and passed CI test -- just need someone to merge no idea about parquet - what are the cases you have in mind? I think as long as we have a good case for that and a BIDS example, good open source format are what we need |
I don't know how much priority BIDS developers give to Octave - at least for me, for the purpose of batch/parallel processing without limited by license, it is an quite useful platform given lots of data analysis tools were written in the MATLAB language. one thing I want to mention is that it still does not have hdf5 support - I encountered it when processing snirf data |
Hey all, I'm in the neurophysiology/NWB community and just recently getting deep into BIDS. This thread was a great read. It's cool to see such a thriving community here working on this. Another +1 for the usage of parquet for tabular data, e.g. physio, stim, motion, etc. I like to call these types of data "measurements" and call e.g. participants.tsv, eletrodes.tsv, etc. "records." The current TSV have some problems that are limiting for measurements:
Gzipping the TSVs doesn't really solve any of these issues. Parquet is more performant in read, write, and storage volume, and is an open standard with large cross-platform support. We are looking at adopting BIDS for neurophysiology applications. Without a binary-style filetype option, we would need to convert our efficient data storage solutions into TSV which is a much less efficient/performant file type than the current solution. Being able to use parquet for physio etc. would make me much more comfortable with adopting BIDS. I also think this should be considered separately from a format for ND data. I have thoughts on that as well but will save that for another post. |
For multi-dimensional arrays, @effigies' post #197 (comment) did a good job of summarizing the pros and cons. IMO it's worth using a standard that allows chunking and compression like HDF5 or Zarr. These multi-dimensional arrays can get quite big and being able to compress and have direct access facilitates certain kinds of applications and use-cases. I'm not aware of a similar standard that only allows a single dataset, but it would not be hard to impose that for HDF5 or Zarr. A good example of a use-case that leverages the compression and direct access of HDF5/Zarr is Neurosift developed by @magland. This project also demonstrates that you can indeed read HDF5 files in javascipt as part of web applications. Here is an example of a view that demonstrates streaming a portion of NWB data on DANDI directly from an HDF5 file on S3 into the browser and plotting it. DANDI wants large datasets to be compressed and Neurosift would not be possible without direct access for large files. As @satra mentioned, Zarr is better than HDF5 for cloud compute. The issue is really the C library h5lib rather than the HDF5 standard itself. @rabernat mentioned kerchunk, which fixes this problem for HDF5 by indexing all the chunks and creating a json file that expresses them as a Zarr dataset. Then you can read the HDF5 file directly using Zarr tools, which can read chunks faster and in parallel. In the last few weeks, @magland built a kerhunk-inspired approach to pre-index HDF5 files on DANDI, which has dramatically sped up Neurosift without moving the data from the original HDF5 files. Going straight Zarr is fine for the cloud, but you may run into problems with certain compute environments, e.g. HPC and local. With Zarr, every data chunk is its own file so you may end up with an inode issue. That is to say, you simply have too many files for your computer to keep track of. The Zarr community is working on solving this with sharding as @satra mentioned, which groups chunks into larger files (see ZEP002). This has been accepted by the Zarr community for v3, but to my knowledge has not been implemented. I've been experimenting a bit with an approach that is a combo of @SylvainTakerkart 's suggestion and kerchunk. You can write an external file that can define a small JSON index that points into any chunked or non-chunked binary data file in Zarr style. Then you can use the Zarr API to read from Zarr, npy, HDF5, TIFF, and also lots of proprietary formats like spikeGLX, Blackrock, and OpenEphys binary. You can also define extra metadata in this json to annotate dimensions so this can be read into Xarray with dimensional labels, so you could explicitly label dimensions as well as individual rows/columns. Then you could support HDF5 and Zarr as well as many other source formats without having to copy the data into a standardized file format. I have a gist on this here. |
@bendichter this issue should be closed really -- there is a pull request but we need more examples |
moved to pull request 1614 after Copenhagen meeting |
|
agreed with @Remi-Gau @CPernet while I understand your motivation to "move forward" with solving this issue and directing attention to the PR that was discussed in the meeting in Copenhagen 2023, I would personally (and in my role as a BIDS maintainer) prefer if we kept the fruitful discussion in this issue open until all its aspects are resolved. |
Sure - although editing the PR seems more fruitful... |
Agreed.
|
updated the repo and added an example (with the issue it is for data we do not support yet) @effigies |
At the BIDS-ComputationalModels meeting, it became pretty clear that a wide range of applications require (or would at benefit considerably from) the ability to read in generic n-dimensional arrays from a binary file. There are at least two major questions that should be discussed here, and then we should move to draft a PR modifying the BIDS-Raw specification:
What file format should we use? This should be something generic enough that it can be easily read on all common platforms and languages. The main proposals that came up at the meeting were for numpy (
.npy
) or HDF5 containers (.h5). While numpy is technically a Python format, it's sufficiently simple and well-supported that there appear to be available libraries for the major languages. Please suggest other potential solutions.How and where should we represent associated metadata? The generic file format (and naming conventions, etc.) will eventually described in the BIDS-Raw spec, alongside all of the other valid formats (.tsv, nifti, etc.). But some applications are likely to require fairly specific interpretations of the data contained in the file. There appears to be some convergence on the notion of representing the relevant metadata in relevant sections of the BIDS-Derivatives spec (or current BEPs)—i.e., every major use case would describe how data in the binary array format should be interpreted when loaded. We could also associate suffixes with use cases, so that a tool like PyBIDS can automatically detect which rules/interpretations to apply at load time. But if there are other proposals (e.g., a single document describing all uses cases), we can discuss that here.
I'm probably forgetting/overlooking other relevant aspects of the discussion; feel free to add to this. Tagging everyone who expressed interest, or who I think might be interested: @JohnGriffiths @maedoc @effigies @yarikoptic @satra.
The text was updated successfully, but these errors were encountered: