Schemas? #24

ivirshup · 2022-08-18T13:43:07Z

A number of our "elements" will be data structures defined in different libraries with specific contents. How do we define this programmatically?

This has come up many times, so I'm opening this issue to start collecting thoughts and possible solutions.

Tools/ approaches

pydantic – the data validation tool so beloved it delayed a python release.
panderas – Specifications for pandas dataframes, integrated with pydantic. They would like to be able to validate other numeric objects (e.g. from xarray) Abstract out validation logic to support non-pandas dataframes, e.g. spark, dask, etc unionai-oss/pandera#381
xarray-schema – xarray-schema lets one define schema's for xarray objects. It's mentioned in a number of xarray issues, often in conjunction with pydantic (Representing & checking Dataset schemas pydata/xarray#1900).
spatial-image – The spatial-image package uses xarray-dataclasses to map between an in memory xarray DataArray and OME-NGFF images.

Requirements

I think we would need to use something fairly extendable. That is, we are using a number of classes including pd.DataFrame, ad.AnnData, xr.DataArray. We should ideally be able to express a schema for all of these using a single framework.

The text was updated successfully, but these errors were encountered:

giovp · 2022-08-24T10:21:18Z

this is super useful info @ivirshup !

trying to put things together in a very high level view, do you think what we'd want is something like the spatial-image package, but built using xarray-schema (for pydantic support) ? At least for images and labels, it looks relatively straightforward to do.

ivirshup · 2022-08-24T12:32:16Z

I think so.

I think the real challenge is that we want to be able to validate xarray, anndata, and json objects (at least) within the same framework. We want to be able to say things like "the 'regions' from this json are the categorical values of this AnnData's obs's column". I think pydantic will be flexible enough to let us specify this.

I think we'll also want to validate existing objects and not define as many classes as spatial-image does.

ivirshup · 2022-10-05T11:49:55Z

Starting to hit some very real limitations of json-schema, and finding what logic would need to be covered by pydantic. The biggest thing is that json-schema doesn't really do any sort of dynamic constraints, everything must be statically defined.

For example, I cannot say the arrays at property "a" and property "b" must be the same length. I could maybe get around this by saying: "must be one of: (len(a) == 1 and len(b) == 1, len(a) == 2 and len(b) == 2). But this isn't great.

Past that, I wouldn't be able to say: "the value of property 'c' must be a property of a sibling object".

giovp · 2022-10-10T10:15:06Z

some more thoughts after taking an in-depth look for the IO

Assumptions

1. TYpes of Elements in SpatialData
as discussed with @LucaMarconato and @ivirshup this is the Type hierarchy we'd probably like to have (copied from design doc):

- Image `type: Image`
- Regions `type: Union[Labels, Shapes]`
    - Labels `type: Labels`
    - Shapes `type: Shapes` (base type)
        - Polygons `type: Polygons`
        - Circles `type: Circles`
        - Squares `type: Squares`
- Points `type: Points`
- Tables `type: Tables`

2. Native PyData types v. new spatialdata classes
The spatialdata Elements do not have to be new classes (which would anyway be thin wrappers of e.g. xarray, anndata etc.) but could just be common pydata objects (again xarray, anndata, geopandas). Here I'll briefly outline the pro's and con's of using native pydata types:

pro: users don't need to re-learn a new object and instead already be familiar with what they are used to.
cons: it's unclear how much flexibility we'd need downstream, and having our own classes would give us more room for some implementations.

However, what we really need right now for a data type that represent an element is:

a place to store metadata (generic key-value pairs)
a place to store specific spatialdata classes, such as Transformations and CoordinateSystems.

It seems that we can do both points by simply saving the above objects in e.g. xarray.DataArray.attrs, adata.uns, geopandas.GeoDataFrame.attr. Therefore, I'd suggest to first implement Elements as native data types (I think we already kind of agreed on this but didn't settle on decision).

User interaction with `Elements`

Users can instantiate spatialdata objects in multiple ways:

by initializing an object explicitly passing e.g. numpy arrays, metadata etc to SpatialData(...).
- if we don't want users to pass numpy arrays to e.g. SpatialData(images={"image1":...}), we'd still need to provide them with some conversion method (aka parsing)
by reading in pre-processing outputs (e.g. spaceranger) -> this is something being worked in https://github.com/scverse/spatialdata-io . Part of this work could also be done with parsing methods.
by reading in zarr containers that were saved following spatialdata specs, or that were saved following ome-ngff specs. This is what a schema to validate is useful.

For all the above cases, and assuming that we want to directly use python native types, I think before having a working IO, we first want to have a schema to validate and parse Elements (pydantic docs have interesting discussion on the topic https://pydantic-docs.helpmanual.io/usage/models/#data-conversion )

Discussion

After taking another look at the resources pointed above, it turns out we can't use one strategy to do both, yet we can mix and match them to achieve quite powerful outcome. The approach would the following:

define types upfront for both validating and parsing tools
use xarray-schema to validate
use xarray-dataclasses to parse (this is the tool used also by spatial-image )

The result looks a tad verbose yet not complicated.

import xarray as xr
import numpy as np
from dataclasses import dataclass
from typing import Literal, Tuple
from xarray_dataclasses import AsDataArray, Data
from xarray_schema import DataArraySchema

Labels_t = np.int_
Image_t = np.float_
X, X_t = "x", Literal["x"]
Y, Y_t = "y", Literal["y"]
C, C_t = "c", Literal["c"]

Labels_s = DataArraySchema(dtype=Labels_t, dims=(X, Y))
Image_s = DataArraySchema(dtype=Image_t, dims=(C, X, Y))

@dataclass
class Labels2D(AsDataArray):
    """2D Label as DataArray."""

    data: Data[Tuple[X_t, Y_t], Labels_t]


@dataclass
class Image2D(AsDataArray):
    """2D Image as DataArray."""

    data: Data[Tuple[C_t, X_t, Y_t], Image_t]


labels_arr = np.ones((200, 300), dtype=np.int_)
da = xr.DataArray(labels_arr, dims=["x", "y"])
da.equals(Labels2D.new(labels_arr))  # True
Labels_s.validate(da)  # None

image_arr = np.ones((3, 200, 300), dtype=np.float_)
da = xr.DataArray(image_arr, dims=["c", "x", "y"])
da.equals(Image2D.new(image_arr))  # True
Image_s.validate(da)  # None

image_arr = np.ones((3, 200, 300), dtype=np.int_) # type is different
da = xr.DataArray(image_arr, dims=["c", "x", "y"])
da.equals(Image2D.new(image_arr)) # True, cast to np.float_
Image_s.validate(da)  # False, dtype is np.int_, should be np.float_

this would also allow us to have a way to directly parse multiscale images as done by spatial-image which is something quite powerful to have from the start imho.

for Tables, we'd have to hack our own solution for now (and maybe use this as chance to have a minimal implementation in anndata @ivirshup ? )
for polygons, we could use pandera with geopandas although we'd have to hack our own method for parsing.

curious to hear what you think.

LucaMarconato · 2022-10-10T11:09:27Z

Thanks for the information, the approach and code looks good to me. If I understood correctly you propose to drop the classes Element, Image, Labels, etc. I think I finally got convinced that we should not expose them to the user and use native types, but I think we should keep them for the internals.

For example it's handy to have types that can be checked and used to act differently depending on the spatial element type. Also in this way there is only the type Image, which could be a c, y, x, a z, y, x, a y, x image etc, and which inside contains the information regarding the axes (stored inside dims in xarray).

I would propose for the moment to expose only native types to the users, and to store all the information inside them (axes, transformations, coordinate systems), but for convenience to keep having thin wrappers to them in the code, at least for the moment. So for instance .transformations, which lives inside Elements, will now be a property that gets the information from the underlying native types, instead then being an attribute which decouples the image information from the transformations.

ivirshup · 2022-10-10T11:37:02Z

Short reply for now: @giovp, there's been some previous discussion about this on the zarr gitter.

giovp · 2022-10-10T11:47:32Z

thanks for both answers, I think I'll go ahead and do a first implementation of this. @LucaMarconato I think your comment is worth further discussion so I'll create a new separate issue.

LucaMarconato · 2023-03-15T15:58:31Z

Closing, schemas have been implemented.

giovp mentioned this issue Oct 10, 2022

Interactions between transformations and elements #33

Closed

LucaMarconato closed this as completed Mar 15, 2023

rly mentioned this issue Feb 3, 2024

axis vs. index, sparse vs. dense array semantics and syntax? linkml/linkml-arrays#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schemas? #24

Schemas? #24

ivirshup commented Aug 18, 2022

giovp commented Aug 24, 2022

ivirshup commented Aug 24, 2022

ivirshup commented Oct 5, 2022

giovp commented Oct 10, 2022 •

edited

Loading

LucaMarconato commented Oct 10, 2022 •

edited

Loading

ivirshup commented Oct 10, 2022

giovp commented Oct 10, 2022

LucaMarconato commented Mar 15, 2023

Schemas? #24

Schemas? #24

Comments

ivirshup commented Aug 18, 2022

Tools/ approaches

Requirements

giovp commented Aug 24, 2022

ivirshup commented Aug 24, 2022

ivirshup commented Oct 5, 2022

giovp commented Oct 10, 2022 • edited Loading

Assumptions

User interaction with Elements

Discussion

LucaMarconato commented Oct 10, 2022 • edited Loading

ivirshup commented Oct 10, 2022

giovp commented Oct 10, 2022

LucaMarconato commented Mar 15, 2023

giovp commented Oct 10, 2022 •

edited

Loading

User interaction with `Elements`

LucaMarconato commented Oct 10, 2022 •

edited

Loading