-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiscale convention #125
Comments
👀 I'm very excited to see this moving forward! cc @TomNicholas (DataTree) and @freeman-lab / @katamartin (ndpyramid/maps) |
Status update after a hand-full of weeks talking with @aurghs, @TomNicholas, @alexamici, and more recently @malmans2 about this:
|
@joshmoore Can you help me understand a bit more the status of multiscale proposals and support? I think it might help if we can distinguish between Python APIs and data formats. I took a quick look at xarray-datatree and I didn't see anything specifically related to multiscale support. Additionally, my understanding is that on top of the base zarr v2 data model, xarray adds only 2 things:
Both of these features seem to be basically completely orthogonal to multiscale. The xarray-multiscale library seems to be a purely in-memory thing with no specific data format. As far as the actual multiscale representation on disk in terms of zarr attributes, it sounds like we are talking about the format proposed here: There was a lot of discussion on that issue so I'm not clear on whether there is an actual final proposed format. But do I understand correctly that the current proposal does not specify downsample factors or offsets for each level? If so I think it is critical that we rectify that as otherwise we must assume e.g. 2x downsampling and zero offset in all dimensions at each level, which obviously is extremely limiting. I would propose that we rectify it as follows: Add to each element of the
Note: Rational numbers allow non-integer downsample factors to be represented without any loss of precision, but in most cases both the At a given downsample level Certainly there are other ways to specify this information, but it is critical that we decide on some way to specify it, and I think what I have proposed here is a reasonable and natural choice. Potentially for simplicity the rational number support could be skipped in a first version, and instead integers could be required. |
@jbms can you explain why explicitly enumerating downscaling factors is preferred over the more explicit approach where each dataset declares its scale and offset? |
I thought this proposal is being discussed in the context of zarr rather than OME and was not aware of a proposal to specify the offset and scale other than the OME coordinate transforms. For applications where you intend to do integer indexing rather than interpolation-based continuous indexing, it is important to be able to represent the relative scales and offsets between levels exactly. Normally when dealing with physical units you would use floating point (and it is often reasonable for that purpose since physical units are surely approximate anyway) which means you must rely on inexact floating point arithmetic to determine relative scales and offsets. I suppose in principle if you represented the offsets and scales using an exact representation like rational numbers it would solve the issue. But in general in the integer indexing case it is the downsample factors rather than the units that may be more relevant so storing the units rather than the downsample factors seems less direct, not more direct. |
Yes, this is what I had in mind. But if, as you note, the idea of this issue is to have a multiscale zarr model that works for xarray, OME coordinate transforms are out of scope and probably redundant, since xarray solves the coordinate specification problem by treating coordinates as data. But if this multiscale zarr entails storage explicit coordinates, it's not clear if there's any need for special metadata describing downscaling factors. |
My own interest is in a multiscale convention/format generally, not specifically related to xarray, and in particular not tied to the use of coordinate arrays, as I think for arrays defined on regular grids, coordinate arrays are a rather indirect and inconvenient representation for |
Just chiming in to note that:
I'm also not saying that anyone should have to use xarray either. But if Zarr can describe these indexes in its metadata, Xarray should be able to parse them out and turn them into useable index objects (as it does today with dimension coordinates). We have discussed before (#122) where such index metadata conventions should live: in Zarr user attributes or in a special zarr extension? My question is whether the indexing question is separable from the multiscale convention? Or must these be addressed together? |
Thanks for your explanation @rabernat . I think the distinction between zarr user attributes or zarr core attributes is not too important --- it seems quite reasonable to use zarr user attributes, but I would still like to have a standard so that tools can interoperate. To me it doesn't make sense to define a "multiscale array" as a concept without specifying what the scales actually are. Otherwise you are just saying --- here are some arrays that represent the same data at different scales, but good luck in figuring out how they correspond. I don't see how a tool would make any use of that. So I don't think we can address multiscale without addressing these indexing issues. But on the other hand perhaps indexing can be addressed before addressing multiscale. Per the suggestion by @d-v-b that the metadata live in the per-scale array rather than the mutiscale metadata attribute, we could simply move |
Agreed. Downsampling data necessarily generates a new coordinate grid for that data; Consumers need to know the downsampled coordinate grid in order to meaningfully relate a downsampled image back to the original image. A specification that merely encodes "here are some arrays that all have the same dimension names" isn't of much use without encoding the coordinate grid for each image. |
This is a good point and things are certainly intermingled still. Much of this issue is certainly about the interoperability at the Python level and expressing a desire to support xarray's upcoming hierarchical functionality on the Zarr side. What lessons need to be learned, etc. rather than completely specifying multiscale metadata as we're doing with OME-NGFF.
This issue is definitely for the Zarr side and independent of OME-NGFF, but I don't think we're to the stage of building the entire spec now. One outcome that I think we could shoot for is deciding if and if so where that work will be taken on.
Assuming we develop an extension/convention here, one thing that occurs to me is how balance the metadata in the OME-NGFF spec. Is there overlap? Conversion? ... Confusion?
It would be interesting to hear what others have to say on that front. @christophenoel? At least with the xarray api, conceivably there are some operations that would still be useful even without the metadata. |
Thanks for your comments @joshmoore and @d-v-b. Thinking about this a bit more, I think it might be best if OME can just be made to also work for the discrete indexing case (i.e. without needing to use floating point arithmetic) --- then we could just have a single multiscale spec. The discrete indexing case would just be a special case of the general multiscale array/view where it so happens that no floating point transforms are required. Here is an idea: An OME multiscale has a single associated coordinate space, which typically matches the (integer-indexed) coordinate space of the base array. Coordinate spaces are basically the same as in the current OME spec, except that their units can have arbitrary coefficients, not just powers of 10. That allows us to have coordinate spaces where we can still do useful integer indexing. For example:
For each dataset listed in I think this representation also addresses the concern by @d-v-b that each array stand on its own --- if you open /my_multiscale/s1 on its own, you will see that there is a coordinate transform to the "/my_multiscale" coordinate space and can view it under that space if you wish. |
A benefit of unifying the two in one spec would be the possibility of extracting it wholesale in the future for wider adoption. However, I note that we've now diverted this issue away from its initial purpose of capturing the ongoing integration work with xarray. Can we continue your suggestion in #125 (comment), @jbms, along with @bogovicj's work in ome/ngff#101 ? |
Only for information: in last GeoZarr version we decided to rely on the 'historical' zoom level conventions: defacto standard level 0 as 256x256 pixels covering the entire world (and default pseudo Plate Carre non-projection), and scale doubled on each level as per https://wiki.openstreetmap.org/wiki/Zoom_levels (see also ArcGIS note about A brief history of zoom levels: The group name indicates the zoom level. These conventions are simple, well supported by viewers, and very similar to the overviews mechanism in Cloud Optimised GeoTiff (COG). I will publish very soon a demonstration video, and an OpenLayers extension that supports GeoZarr including multiscaling. GeoZarr multiscales: https://github.com/christophenoel/geozarr-spec/blob/main/geozarr-spec.md#multiscales |
I'll include something on discrete indexing in the multiscale section next time I edit the ome-zarr spec. Thanks for the mention! |
Hi. Here below are the documentation resources I have mentioned:
|
As a part of the CZI EOSS4 grant, B-Open will be working on the development of a cross-community convention for the multiscale representation. (See original use case and proposal). This work targets interoperability between the bioimaging and geospatial use cases and especially between Zarr and Xarray, where pydata/xarray#4118 proposes an extension to the Xarray library which will enable data structures like multiscales.
This issue serves as an overarching reference for the work. Tasks include:
Related used cases:
The text was updated successfully, but these errors were encountered: