Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schemas? #24

Closed
ivirshup opened this issue Aug 18, 2022 · 8 comments
Closed

Schemas? #24

ivirshup opened this issue Aug 18, 2022 · 8 comments

Comments

@ivirshup
Copy link
Member

A number of our "elements" will be data structures defined in different libraries with specific contents. How do we define this programmatically?

This has come up many times, so I'm opening this issue to start collecting thoughts and possible solutions.

Tools/ approaches

Requirements

I think we would need to use something fairly extendable. That is, we are using a number of classes including pd.DataFrame, ad.AnnData, xr.DataArray. We should ideally be able to express a schema for all of these using a single framework.

@giovp
Copy link
Member

giovp commented Aug 24, 2022

this is super useful info @ivirshup !

trying to put things together in a very high level view, do you think what we'd want is something like the spatial-image package, but built using xarray-schema (for pydantic support) ? At least for images and labels, it looks relatively straightforward to do.

@ivirshup
Copy link
Member Author

I think so.

I think the real challenge is that we want to be able to validate xarray, anndata, and json objects (at least) within the same framework. We want to be able to say things like "the 'regions' from this json are the categorical values of this AnnData's obs's column". I think pydantic will be flexible enough to let us specify this.

I think we'll also want to validate existing objects and not define as many classes as spatial-image does.

@ivirshup
Copy link
Member Author

ivirshup commented Oct 5, 2022

Starting to hit some very real limitations of json-schema, and finding what logic would need to be covered by pydantic. The biggest thing is that json-schema doesn't really do any sort of dynamic constraints, everything must be statically defined.

For example, I cannot say the arrays at property "a" and property "b" must be the same length. I could maybe get around this by saying: "must be one of: (len(a) == 1 and len(b) == 1, len(a) == 2 and len(b) == 2). But this isn't great.

Past that, I wouldn't be able to say: "the value of property 'c' must be a property of a sibling object".

@giovp
Copy link
Member

giovp commented Oct 10, 2022

some more thoughts after taking an in-depth look for the IO

Assumptions

1. TYpes of Elements in SpatialData
as discussed with @LucaMarconato and @ivirshup this is the Type hierarchy we'd probably like to have (copied from design doc):

- Image `type: Image`
- Regions `type: Union[Labels, Shapes]`
    - Labels `type: Labels`
    - Shapes `type: Shapes` (base type)
        - Polygons `type: Polygons`
        - Circles `type: Circles`
        - Squares `type: Squares`
- Points `type: Points`
- Tables `type: Tables`

2. Native PyData types v. new spatialdata classes
The spatialdata Elements do not have to be new classes (which would anyway be thin wrappers of e.g. xarray, anndata etc.) but could just be common pydata objects (again xarray, anndata, geopandas). Here I'll briefly outline the pro's and con's of using native pydata types:

  • pro: users don't need to re-learn a new object and instead already be familiar with what they are used to.
  • cons: it's unclear how much flexibility we'd need downstream, and having our own classes would give us more room for some implementations.

However, what we really need right now for a data type that represent an element is:

  • a place to store metadata (generic key-value pairs)
  • a place to store specific spatialdata classes, such as Transformations and CoordinateSystems.

It seems that we can do both points by simply saving the above objects in e.g. xarray.DataArray.attrs, adata.uns, geopandas.GeoDataFrame.attr. Therefore, I'd suggest to first implement Elements as native data types (I think we already kind of agreed on this but didn't settle on decision).

User interaction with Elements

Users can instantiate spatialdata objects in multiple ways:

  • by initializing an object explicitly passing e.g. numpy arrays, metadata etc to SpatialData(...).
    • if we don't want users to pass numpy arrays to e.g. SpatialData(images={"image1":...}), we'd still need to provide them with some conversion method (aka parsing)
  • by reading in pre-processing outputs (e.g. spaceranger) -> this is something being worked in https://github.com/scverse/spatialdata-io . Part of this work could also be done with parsing methods.
  • by reading in zarr containers that were saved following spatialdata specs, or that were saved following ome-ngff specs. This is what a schema to validate is useful.

For all the above cases, and assuming that we want to directly use python native types, I think before having a working IO, we first want to have a schema to validate and parse Elements (pydantic docs have interesting discussion on the topic https://pydantic-docs.helpmanual.io/usage/models/#data-conversion )

Discussion

After taking another look at the resources pointed above, it turns out we can't use one strategy to do both, yet we can mix and match them to achieve quite powerful outcome. The approach would the following:

The result looks a tad verbose yet not complicated.

import xarray as xr
import numpy as np
from dataclasses import dataclass
from typing import Literal, Tuple
from xarray_dataclasses import AsDataArray, Data
from xarray_schema import DataArraySchema

Labels_t = np.int_
Image_t = np.float_
X, X_t = "x", Literal["x"]
Y, Y_t = "y", Literal["y"]
C, C_t = "c", Literal["c"]

Labels_s = DataArraySchema(dtype=Labels_t, dims=(X, Y))
Image_s = DataArraySchema(dtype=Image_t, dims=(C, X, Y))

@dataclass
class Labels2D(AsDataArray):
    """2D Label as DataArray."""

    data: Data[Tuple[X_t, Y_t], Labels_t]


@dataclass
class Image2D(AsDataArray):
    """2D Image as DataArray."""

    data: Data[Tuple[C_t, X_t, Y_t], Image_t]


labels_arr = np.ones((200, 300), dtype=np.int_)
da = xr.DataArray(labels_arr, dims=["x", "y"])
da.equals(Labels2D.new(labels_arr))  # True
Labels_s.validate(da)  # None

image_arr = np.ones((3, 200, 300), dtype=np.float_)
da = xr.DataArray(image_arr, dims=["c", "x", "y"])
da.equals(Image2D.new(image_arr))  # True
Image_s.validate(da)  # None

image_arr = np.ones((3, 200, 300), dtype=np.int_) # type is different
da = xr.DataArray(image_arr, dims=["c", "x", "y"])
da.equals(Image2D.new(image_arr)) # True, cast to np.float_
Image_s.validate(da)  # False, dtype is np.int_, should be np.float_

this would also allow us to have a way to directly parse multiscale images as done by spatial-image which is something quite powerful to have from the start imho.

for Tables, we'd have to hack our own solution for now (and maybe use this as chance to have a minimal implementation in anndata @ivirshup ? )
for polygons, we could use pandera with geopandas although we'd have to hack our own method for parsing.

curious to hear what you think.

@LucaMarconato
Copy link
Member

LucaMarconato commented Oct 10, 2022

Thanks for the information, the approach and code looks good to me. If I understood correctly you propose to drop the classes Element, Image, Labels, etc. I think I finally got convinced that we should not expose them to the user and use native types, but I think we should keep them for the internals.

For example it's handy to have types that can be checked and used to act differently depending on the spatial element type. Also in this way there is only the type Image, which could be a c, y, x, a z, y, x, a y, x image etc, and which inside contains the information regarding the axes (stored inside dims in xarray).

I would propose for the moment to expose only native types to the users, and to store all the information inside them (axes, transformations, coordinate systems), but for convenience to keep having thin wrappers to them in the code, at least for the moment. So for instance .transformations, which lives inside Elements, will now be a property that gets the information from the underlying native types, instead then being an attribute which decouples the image information from the transformations.

@ivirshup
Copy link
Member Author

Short reply for now: @giovp, there's been some previous discussion about this on the zarr gitter.

@giovp
Copy link
Member

giovp commented Oct 10, 2022

thanks for both answers, I think I'll go ahead and do a first implementation of this. @LucaMarconato I think your comment is worth further discussion so I'll create a new separate issue.

@LucaMarconato
Copy link
Member

Closing, schemas have been implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants