-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: zarr spec v3: adds optional dimensions and the "netZDF" format #276
Conversation
Thanks @shoyer for writing this up. I had been using ZCDF as the acronym for this feature set in zarr but also don't have strong feelings about the name at this point. FWI, @alimanfoo, @rabernat, @mrocklin, and I have had a few offline exchanges on the subject (see https://github.com/jhamman/zcdf_spec for a Zarr subspec that describes what xarray has currently implemented). Without speaking for anyone else, I think there is growing excitement about the concept of a Zarr+NetCDF data model. |
Great to see this. I like the design, it's simple and intuitive. Couple of questions...
|
I find myself agreeing with this. I think that ideally Zarr would remain low-level and that we would provide extra conventions/subspecs on top of it. My understanding is that one reason for HDF's current inertia is that it had a bunch of special features glommed onto it by various user communities. If we can separate these out that might be nice for long term maintenance. |
No, for more sophisticated metadata needs we can simply use a subset of CF Conventions. These are pretty standard for applications that handle netCDF files, like xarray.
This is a good question. Mostly it comes down to having the specs all in one place, so it's obvious where to find this convention for everyone implementing the zarr spec. Dimensions are broadly useful enough for self-described data that I think people in many fields would find them useful. In particular, I would hate to see separate communities develop their own specs for dimensions, just because they didn't think to search for I also think there are probably use cases for defining named dimensions on some but not all arrays and/or axes. This wouldn't make sense as part of the "netzdf" spec which xarray would require. Finally, incorporating |
Have not thought about this too deeply yet. So this is just a very rough idea that we can discard if it doesn't fit, but what if we added ZCDF as a library within the org that built off Zarr? This would address some of the discoverability, and feature creep concerns raised thus far. It would also eliminate the need for things like checks as to whether the NetCDF spec is implemented by specific objects. |
If dimensions are applicable enough across other domains then I'm happy to
relax my objections. I think that it would be useful to hear from people
like @jakirkham (who comes from imaging) if this sort of change would be
more useful or burdensome for his domain.
…On Mon, Jul 16, 2018 at 12:25 PM jakirkham ***@***.***> wrote:
Have not thought about this too deeply yet. So this is just a very rough
idea that we can discard if it doesn't fit, but what if we added ZCDF as a
library within the org that built off Zarr? This would address some of the
discoverability, and feature creep concerns raised thus far. It would also
eliminate the need for things like checks as to whether the NetCDF spec is
implemented by specific objects.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#276 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszEmvz231TdfCyBIeX59m3iG3n4fiks5uHL5fgaJpZM4VQat7>
.
|
FWIW we could add this as a "NetZDF spec" (or whatever name) alongside the existing storage specs in the specs section of the Zarr docs, should be pretty visible (in fact might be more visible as it would get its own heading in the toc tree). I would be keen to minimise disruption for existing users and implementers if possible. A spec version change would imply some inconvenience, even if relatively small, as existing data would need to be migrated. |
My understanding is that this proposal is entirely compatible (both
backwards and forwards) with existing data
…On Mon, Jul 16, 2018 at 12:30 PM Alistair Miles ***@***.***> wrote:
This is a good question. Mostly it comes down to having the specs all in
one place, so it's obvious where to find this convention for everyone
implementing the zarr spec. Dimensions are broadly useful enough for
self-described data that I think people in many fields would find them
useful. In particular, I would hate to see separate communities develop
their own specs for dimensions, just because they didn't think to search
for zarr netcdf.
FWIW we could add this as a "NetZDF spec" (or whatever name) alongside the
existing storage specs in the specs section of the Zarr docs
<http://zarr.readthedocs.io/en/stable/spec.html>, should be pretty
visible (in fact might be more visible as it would get its own heading in
the toc tree).
I would be keen to minimise disruption for existing users and implementers
if possible. A spec version change would imply some inconvenience, even if
relatively small, as existing data would need to be migrated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#276 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszPmHbneBX9egqSrfcjmRmW0AJ0evks5uHL-IgaJpZM4VQat7>
.
|
Indeed. I considered naming this "v2.1" based on semantic versioning until I saw that the zarr spec only uses integer versions. The only backwards incompatibility it introduces is the addition of new optional metadata fields. I would hope that any existing software would simply ignore these, rather than assume that no fields could ever be introduced in the future. |
Yes, this makes some amount of sense. The main downside of incorporating these changes into zarr proper is that for netCDF compatibility we really want the guarantee of consistent dimension sizes between arrays. This would require a small amount of refactoring and additional complexity to achieve within the Zarr library. |
docs/spec/v2.rst
Outdated
(any non-``null`` value), MUST also be defined on an ancestor group. Dimension | ||
sizes can be overwritten in descendent groups, but the size of each named | ||
dimensions on an array MUST match the size of that dimension on the most direct | ||
ancestor group on which it is defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to change this, to make group dimensions and consistency entirely optional:
If dimensions are set in a group, their sizes on all contained arrays
are REQUIRED to be consistent. Dimension sizes can be overwritten
in descendant groups, but the size of each named dimension (any
non-`null` value) on an array MUST match the size of that dimension
on the most direct ancestor group on which it is defined.
From a neuroscience data perspective, this gets pretty complicated pretty fast if one wants to be general. Please see NWB as an example. Personally wouldn't want Zarr to take any of this on. It would be better handled in a library on top of Zarr. Note NWB currently is built on top of HDF5, but it would be reasonable to consider an NWB spec on top of Zarr. Can't speak to the Earth Sciences or what people in this field want out of Zarr. If dimensionality is the one thing desired, maybe this is ok. If there are 5 or 6 more things needed in the pipe, maybe having a library built on top of Zarr would be better. Would be good if some people could answer these sorts of questions. |
Sorry for the multiple posts. GitHub is having some sort of issue that is affecting me. |
Some miscellaneous thoughts about dimensionality in our field since Matt asked. Naming dimensions has certainly come up before. Here is one example and another. Also some discussion about axes names in this comment and below. Scientist definitely like having this sort of feature as it helps them keep track of what something means and is useful if the order ever needs to change for an operation. So this sort of use case benefits from the proposal. The other thing that typically comes to mind when discussing dimensions, which I don't think has come up thus far is units. It's pretty useful to know something is in For tracking time in some cases we have timestamps. This supplants the need for dimension or units and often parallels other information (e.g. snapshots of other data at a particular time). This could use existing things like structured arrays. However when looking applying some basic machine learning, dimensions pretty quickly become a mess. Especially if various different kinds of dimensions get mixed together. For example PCA is a pretty common technique to perform in a variety of cases to find the biggest contributors to some distribution. The units of this sort of thing are frequently strange and difficult to think about. This case probably either needs a different proposal or users have to work with existing metadata information to make this work for their use case. |
Also cc'ing @ambrosejcarr and @Cadair to add some domain breadth to this
discussion
…On Mon, Jul 16, 2018 at 2:39 PM jakirkham ***@***.***> wrote:
Some miscellaneous thoughts about dimensionality in our field since Matt
asked.
Naming dimensions has certainly come up before. Here is one example
<https://ukoethe.github.io/vigra/doc-release/vigranumpy/index.html#axistag-reference>
and another
<https://ukoethe.github.io/vigra/doc-release/vigranumpy/index.html#vigra.VigraArray.withAxes>.
Also some discussion about axes names in this comment and below
<imageio/imageio#263 (comment)>.
Scientist definitely like having this sort of feature as it helps them keep
track of what something means and is useful if the order ever needs to
change for an operation. So this sort of use case benefits from the
proposal.
The other thing that typically comes to mind when discussing dimensions,
which I don't think has come up thus far is units. It's pretty useful to
know something is in ms, mV, or other relevant units. Libraries like
quantities
<http://python-quantities.readthedocs.io/en/latest/user/tutorial.html> or
pint <http://pint.readthedocs.io/en/latest/> are useful for tracking
units and combining them sensible. This could be an addition to the
proposal or perhaps something to add to a downstream data format library.
For tracking time in some cases we have timestamps. This supplants the
need for dimension or units and often parallels other information (e.g.
snapshots of other data at a particular time). This could use existing
things like structured arrays.
However when looking applying some basic machine learning, dimensions
pretty quickly become a mess. Especially if various different kinds of
dimensions get mixed together. For example PCA is a pretty common technique
to perform in a variety of cases to find the biggest contributors to some
distribution. The units of this sort of thing are frequently strange and
difficult to think about. This case probably either needs a different
proposal or users have to work with existing metadata information to make
this work for their use case.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#276 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszLobBIjoFqexpqaY836lTs7dphKuks5uHN3CgaJpZM4VQat7>
.
|
From the point of view of the Unidata netcdf group, named dimensions (shared dimensions in netcdf parlance) is essential for managing coordinate variables. So the netcdf extension to Zarr (or possibly TIleDB) will include only named dimensions and anonymous dimensions will probably be suppressed. We went around about this with the HDF5 group long ago. |
Speaking as a user in the genomics domain, I certainly would find this feature useful, it is common to have multiple arrays sharing dimensions. I don’t have a broad experience in other domains but expect this feature to be generally very useful. So I am very supportive and would like to give this as much prominence as possible. My reasons for leaning towards use of .zattrs is not meant in any way to diminish the importance or broad applicability of this feature, it is based purely on technical considerations, basically on what is easiest to implement and provides the least disruption for existing users and implementers.
Yes in theory, although unfortunately it’s not quite that simple in practice. I’ll try to unpack some details about versioning and change management in Zarr. Btw I’m not suggesting this ideal or the best solution, thinking ahead about possible changes and managing compatibility is quite hard. This proposal adds a new feature (dimensions) to the Zarr storage spec. This feature is optional in two senses. First it is optional in that it specifies elements that do not need to be present in the array or group metadata. Second it is optional for the implementation, i.e., an implementation can ignore these elements if present in the metadata and still be conformant with the storage spec. When I wrote the v2 storage spec and was thinking about mechanisms for managing change, for better or worse, I did not allow any mechanisms for adding optional features to the spec. There is no concept of minor spec versions, only major versions (single version number). The only way the spec can change is via a major version change, which implies a break in compatibility. If the current implementation finds anything other than “2” as the spec version number in array metadata, it raises an exception. The spec does not define any concept of optional features or leave open the possibility of introducing them (other than via a major version change). If I had been farsighted, I might have seen this coming, and I might have defined a notion of optional features, which could be introduced via a minor version increment to the spec, and implementations could include some flexibility in matching the format minor version number when deciding if they can read some data or not. To be fair I did give this some thought, although I couldn’t have articulated it very well at the time. In the end I decided on a simple versioning policy I think partly because it was simple to articulate and implement, and also because I thought that the user attributes (.zattrs) always provided a means for optional functionality to be layered on. Also the separation between .zattrs and core metadata (.zarray, .zgroup) is nice in a way because it makes it very clear where is the line between optional and required features. I.e., to be conformant, a minimal implementation has to understand everything in .zarray, and can ignore everything in .zattrs. So given all this, there are at least three options for how to introduce this feature. In below, by “old code” I mean the current version of the zarr package (which does not implement this feature), by “old data” I mean data created using old code, by “new code” I mean the next version of the zarr package (which does implement this feature), by “new data” I mean data created using new code. Option 1: Use .zattrs, write this as a separate spec. Full compatibility, old code will be able to read new data, and new code will be able to read old data. Option 2: Use .zarray/.zgroup, incorporate into the storage spec, major version bump (v3). Old code will not be able to read new data. New code can read old data if data is migrated (which just requires replacing the version number in metadata files) or if new code is allowed to read both v2 and v3. Option 3: Use .zarray/.zgroup, incorporate into the storage spec but leave spec version unchanged (v2). Full compatibility, old code will be able to read new data, and new code will be able to read old data. However, this is potentially confusing because the spec has changed but the spec version number hasn’t. Hence I lean towards option 1 because it has maximum compatibility and seems simplest/least disruptive. But very happy to discuss. And I’m sure there are other options too that I haven’t thought of. |
@jakirkham - both of these issues arise in geoscience use cases. We handle them by providing metadata that follows CF conventions and then using xarray to decode the metadata into appropriate types (like a `numpy.datetime64'). This works today with zarr + xarray and doesn't require any changes to the spec. |
Yes, this is why I want named dimensions. I don't think we need explicit support for multi-dimensional coordinate variables in Zarr. NetCDF doesn't have explicit support for coordinates at all, and we get along just fine using CF conventions. HDF5's dimension scales include coordinate values as well as dimension names. But in my opinion this is unnecessarily complex. Simple conventions like treating variables with the same name as a dimension as supplying coordinate values are sufficient. |
It might be useful for the discussion if I explain what xarray currently does to add dimension support to zarr stores. This might help clarify some of the tradeoffs between option 1 (just use .zattrs) vs. options 2/3. When xarray creates a zarr store from an xarray dataset, it always creates a group. On each array in the group, it creates an attribute called When the group is loaded, xarray checks for the presence of this key in the attributes of each array. If it is missing it raises an error--xarray can't read arbitrary zarr stores, only those that match its de-facto spec. If it finds the |
|
WRT to things like units, you need to be very careful about embedding |
Remember that the same dimension may be used in multiple variables, so it is |
Just wanted to briefly chime in that I'm very happy to see NetCDF folks active in this discussion. |
BTW, one common example of multidimensional coordinate variables is when |
@alimanfoo I suspect this will not be the last change we will want in the zarr spec (e.g., to support complex numbers), so it might make sense to "bite the bullet" now with a major version number increase, and at the same time establish a clear policy on forwards compatibility for Zarr. I am confident that Zarr will only become more popular in the future! I would suggest looking at the forwards compatibility policies from HDF5 and protocol buffers for inspiration:
Going forward, I would suggest the following forward and backwards compatibility policies, which we can add to the spec:
Doing a little more searching, it appears that such a convention is actually widely used. E.g., see "Versioning Conventions" for ASDF and this page on "Designing File Formats" (it calls what I've described "major/minor" versioning). |
In the netCDF spec I find "coordinates" only mentioned for netCDF4, specifically for the I see that internally, netCDF4 maintains its own notion of "dimension scales" that support more than 1 dimension (beyond what HDF5 supports), which it appears to use for variables if their first dimension matches the name of the variable: Note that this definition of a multi-dimensional coordinate does not even match the typical interpretation of "coordinates" by software that reads netCDF files. Per CF Conventions, "coordinates" are defined merely by being referenced by a "coordinates" attribute on anothe rvariable, without any requirements on their name matching a dimension. I'm getting a little off track here, but I think the conclusions we can draw from the netCDF4 experience for Zarr are:
|
I stand corrected. One discussion of coordinate variables is here as a convention. The multi-dimensional issue is complex because it is most often used with what amounts |
Thanks @mrocklin for ccing me onto this thread. By way of introduction, I'm contributing to a package to process image-based transcriptomics (biological microscopy) data, which is easiest to think about as a bunch of two dimensional tiles that sample the z-dimension of the tissue being imaged for multiple color channels and time points. The tiles can be stuck together as a 5-d tensor. We're defining a simple file format for this data which right now is a JSON schema that specifies how to piece the tensor together from a collection of TIFF files stored on a disk or file system, and stores relevant metadata about each tile. This looks a lot like zarr (and Z5), and we're excited by the prospect of contributing to an existing ecosystem instead of rolling our own. One concern we had was that the zarr format could be harder for someone to pick up and use outside the zarr/python ecosystem (we expect to have many R users, for example). The addition of column names is a really nice step towards greater expressivity and self-description, so we like this change quite a lot. Off-topic, we have some more general questions about the zarr format. Is there someone who would be a good contact? tagging @dganguli @freeman-lab @ttung |
Thanks @ambrosejcarr for getting in touch and sharing your use case, very interesting. Regarding usage from R, I've raised a separate issue (#279), would be great to explore ways of making that possible. If you have general questions please feel free to ask via the issue tracker, no problem if an issue is a bag of different questions. |
My current thought for revising this proposal is at the very least the "netZDF" or "Zarr-netCDF format" (no new jargon) should be separated into another doc. However, I still think that optional dimension names as described here (on arrays and groups) could have a place in the Zarr format itself -- assuming we figure out backwards/forwards compatibility concerns. As for additional conventions themselves (referenced in my draft netZDF spec), I'm thinking that it could make sense to define a group-level |
@ambrosejcarr could you clarify what you mean by "column names" here? These changes currently only refer to dimension names. In principle, column names could either be represented by different arrays (with different names) or "column" dimension with an additional array providing coordinate labels. In netCDF files, the typical convention is to use a variable with the same name as its sole dimension: the values of that 1D array provide labels for each point. |
@shoyer Sorry, I was imprecise in my language, dimension names are what I meant to refer to. If I'm interpreting dimension names properly, they are a vector of values equal in length to the number of dimensions in the array. For our data, that would be |
I can't speak for @constantinpape , who is really the architect of z5py while I've just been tinkering round the edges, but I think it's unlikely that a new zarr spec which created a significant divergence from N5 would make it into z5py any time soon - at v1.0, it only supports a subset of the fairly simple (and flexible) N5 spec. @axtimwalde would need to confirm but I would guess that N5 is likely to stay minimal, with individual applications defining their own schema for handling more complicated attributes. This shouldn't discourage you from making these changes, of course - dimension labelling is certainly something which would be helpful to us in the neuroimaging field where the tool set uncomfortably straddles the fortran-ordered java/imageJ and C-ordered python realms, and enforcing dimension parity between raw volumes, labels and so on might be nice too. The combination of these factors makes me personally lean towards
where there is a well-defined way (and probably library) to represent a netCDF-like schema in zarr but it purely uses zarr as a backend rather than being built into the format itself. Naming collisions with user-defined attributes should obviously be avoided; in N5 this is done simply by keeping all of the application-specific optional fields in a dict within the attributes JSON with a name like P.S. In general I'm also in favour of supporting minor version changes; serialising it as a string is fine but it's quite nice to have at least the option of returning a |
Regarding z5py: I am close to a 1.0 release (only need some time and get hands on a Mac and Windows machine to make sure cross platform compatability) @clbarnes is unlikely that we will support a zarr format that diverges too far from the N5 specification any time soon (or at least I won't have time to implement this any time soon). |
Thanks @clbarnes for the comments, much appreciated. Thanks also for raising the point about naming collisions, I was just thinking about that. It would be good to figure out some best practices for how metadata conventions are defined and used. I'm more than happy for that conversation to continue here, but I've also surfaced that as a separate issue to give it some prominence: #280. |
The following keys MAY be present: | ||
|
||
dimensions | ||
A list of string or ``null`` values providing optional names for each ofthe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/ofthe/of the/
Do we want to make a decision about what to do here? Maybe @WardF and the Unidata team might want to weigh in with their thoughts about the future direction of netCDF / zarr integration? |
Absolutely. Reviewing now for comment :) |
It sounds like the consensus was learning towards making this as a separate "Zarr Dimensions Spec", which we could feature in the Zarr docs. I'd be happy with that, with my main concern being that we should have an unambiguous way to avoid name collisions being spec attributes and arbitrary user-defined attributes. |
I would agree with making this as a separate spec as part of the Zarr documentation. We are working towards adding a new format in the core C library; the underlying file storage adheres to the Zarr spec (stored either in object storage or on local disk), with a corresponding data-model represented by the intersection of the NetCDF enhanced data model and the Zarr data model. Zarr already supports the "big" features that are primarily used in the Enhanced file format; compression and chunking. (As an aside, I was in a meeting yesterday and the topic of alternatives to the enhanced file format in an HPC setting was raised. I have no idea if Zarr is a potential solution for HPC in a posix environment, yet, but if it is, there seems to be a lot of interest). In regards to Zarr/NetCDF integration in the future: I think it would be great to have a defined spec or convention that would be adhered to by those writing netCDF model data using Zarr, as well as those writing model data using the core C library/NetCDF API. Our thinking has been that by adhering to the functionality provided by Zarr, it would be a matter of documenting this new format so that anybody writing the model data via Zarr would have the information they need to ensure compatibility. From a technical standpoint, we need to suss out the architecture of the netCDF-C dispatch layer to map into whatever API we end up using or, as appears more and more likely, implement ourselves. While we'd love to use So in a nutshell, we are very interested in the potential integration of Zarr/NetCDF, and will participate as best we can on the Zarr side of things in addition to focusing on the NetCDF library development. I'll also try to broadcast our plans a little more clearly; I shared them at a Pangeo workshop last month(?), and at a couple other meetings since then. I'll see about writing a blog post or finding some other signal-boosting platform on our side. Thanks to everybody who tagged me, and then emailed me, and then emailed me again to make sure I stayed focused on weighing in on this, it shouldn't be necessary now that the 2018 workshop is over :). |
Thanks for the update @WardF. A pure C library that implements the Zarr spec would be a huge win. If this were open sourced with a sufficiently permissive license, I think the community would take to that very quickly; adding various language bindings on top of it. Some discussion along those lines came up in issue ( https://github.com/zarr-developers/zarr/issues/285 ). FWIW most of the people I know (myself included) use Zarr/N5 for large image data processing in an HPC setting. So this is not only a reasonable path for others using HPC, but one that is being actively exercised. Sticking with a pure C base implementation makes perfect sense. The z5 route was mainly interesting from the standpoint of getting something simple up and running quickly and then evolving based on community feedback. Starting with a pure C implementation is definitely the more principled way of doing this and has all the benefits that a pure C implementation comes with. (On a side note: The libxnd container library seems very useful for this effort.) Glad to hear we will be seeing you around more. Looking forward to reading the blog post. Thanks again for the update. :) |
With the caveat that I haven’t spoken with @DennisHeimbigner about it, I wonder if having that stand-alone Zarr C library would make the most sense, as opposed to baked-in to the rest of the C library. It feels like a good idea insofar as it would be more likely to enjoy broader adoption. |
+1_000_000 Especially if the specification strategy is going to be a base zarr spec with different projects defining their own attribute schemas on top of it, having a base C implementation that everyone can use with whatever language they choose to build the reference implementation of their schema in, would be of immense value compared to every schema having to work from the ground up. So long as the base library could be distributed in such a way that this modular approach is easy, that is! I haven't worked with C myself but I understand the packaging and dependency management ecosystems are non-trivial. I have worked a little with rust, where such concerns are very much trivial. It also seems that generating C bindings at build time is not difficult, and much of the logic could probably be lifted straight from https://github.com/aschampion/rust-n5 . |
It's great to see so much support for the C library. I am all in on this. Who would be a good candidate to actually develop such a library? This is important enough to enough [funded] projects that we are not necessarily talking about a volunteer effort. |
Also +1 on a stand-alone Zarr C library. I'll advocate that such a library should live in the zarr-developers org and that UNIDATA devs should take an active role in helping build/maintain it. I suspect we'll find other interested groups that can help with the development/maintenance burden. I recognize that is somewhat orthogonal to the business as usual development strategy but I think the long term benefits to UNIDATA and other Zarr communities will pay off. |
The ownership and maintenance of any project certainly merits discussion. It would be fantastic if there were an externally-owned project that we could contribute to; there's no inherent reason it should be a Unidata product. There would be questions about how much responsibility we can take for non-Unidata products in the long term, but in the immediate future it is work that needs to be done regardless of who owns the product. It is certainly something we would be interested in collaborating on. Any thoughts, @DennisHeimbigner ? |
Some stream-of-consciousness thoughts about a stand-alone library, and what we need from external libraries adopted for use in netCDF.
No point to these other than they are considerations from the netCDF team, moving forward with any solution. |
From @jhamman
From @WardF
From @WardF
From my perspective if this project is likely to be mostly used by the earth science community then I have no concern with Unidata owning the code. They've proven to be effective stewards in the past and have a longer time horizon than typical OSS groups. My only concern here would be that people outside of Unidata would need an easy to way to also maintain some level of control and activity. This is hard. However, if this project is likely to be used by groups outside of the earth science community like @jakirkham 's imaging groups or @ambrosejcarr 's genomics/bio groups then I would suggest that the project be legally owned by a stewardship organization like NumFOCUS, but provide permissive commit/repo-owernship rights to at least one representative of each of the scientific domains so that no one gets blocked by the inaction of others should they choose to take a different path in the future. Regarding @WardF 's other points, I don't anticpate permissive OSS licenses or cross-platform build objectives to be contentious issues in this crowd. |
This discussion is great! Thanks all for sharing your thoughts. Am thinking we should probably segue into a different thread to discuss the pure C implementation of Zarr. Have raised issue ( https://github.com/zarr-developers/zarr/issues/317 ) for this purpose. Sorry for not doing this sooner. Look forward to discussing with you over there. |
Should have added in regards to @mrocklin's point about organizational structure. We have been discussing this broadly w.r.t. Zarr in issue ( https://github.com/zarr-developers/zarr/issues/291 ) and more detailed discussions on specific points have been broken out from there. |
Checking back in here -- I'm prototyping a zarr-based library for our imaging project now, and we've already swapped our sequencing-based output data to zarr. It seems very likely that we will use zarr in production for at least one aspect of our project. I'll chime in on #291 and will likely have a few features to suggest the next few months. 👍 |
I suggest we close this as it is now quite stale. The v3 spec conversation, along with ZEP 4 and NCZarr have surpassed this design. Feel free to reopen if there is more to do here. |
xref #167
For ease of review, this is currently written by modifying
docs/spec/v2.rst
, but this would of course actually be submitted as a separate v3 file.This does not yet include any changes to the zarr reference implementation, which would need to grow at least:
Array.dimensions
Group.dimensions
Group.resize
for simultaneously resizing multiple arrays in a group which share the same dimension (conflicting dimension sizes are forbidden by the spec)Group.netzdf
for indicating whether a group satisfies the netzdf spec or not.Note: I do like "netzdf" but I'm open to less punny alternatives :).