Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ImageStack zarr saving and loading #419

Open
ambrosejcarr opened this issue Aug 14, 2018 · 3 comments
Open

Implement ImageStack zarr saving and loading #419

ambrosejcarr opened this issue Aug 14, 2018 · 3 comments

Comments

@ambrosejcarr
Copy link
Member

We want to begin testing the zarr library more earnestly. A good start would be to write a zarr.save and zarr.load function for ImageStack. Eventually this should hook into the experiment API.

@ambrosejcarr
Copy link
Member Author

ambrosejcarr commented Jan 2, 2019

I dug into zarr, xarray, and dask a bit.

Learnings from playing with zarr

  1. If we want an "image back-end" where we use TIFF with zarr, that is possible and I think that might go into numcodecs.
  2. If we want listable directories, we'd need to implement a new storage class that tracks data using our json structure. That's possible, and a lot of new ones have gone in over the holidays, including a bunch of database back-ends that really clearly show how we'd craft our own. However, I wonder if the easiest solution is to make our buckets public (below I just feed in my credentials, which produces a working solution without writing new xarray code.)
  3. I was able to get xarrays backed by s3-stored zarr-formatted data to save/load. This gets us dask backing for free. I wanted to test Dask in ImageStacks, but I didn't have enough time left to figure out how to make dask work with our Multiprocessing arrays. I settled for testing skimage filter functions on the dask chunks.

Store and load an ImageStack using Zarr and s3fs

# spacetx profile holds my aws credentials for our bucket, granting LIST access.
# as an alternative, we can simply use a public bucket whose root has READ and 
# LIST access.

import boto3
import s3fs
import zarr
import xarray

# start a session, configuring s3 credentials
session = boto3.Session(profile_name="spacetx")

# create an s3 file system
s3 = s3fs.S3FileSystem(
    anon=False, session=session, client_kwargs=dict(region_name="us-west-1")
)

# confirm root list access works
s3.ls('spacetx.starfish.data.public')

# create a MutableMapping to use as a store
store = s3fs.S3Map(
    root='spacetx.starfish.data.public/zarr/baristaseq.zarr', s3=s3, check=False
)

# set the store as a zarr target and look at the contents
root = zarr.group(store=store)
root.tree()

# get a piece of the array
array = root['fov_000']['primary']
subset = array[1, 1, 1]

# try to open this directly from xarray

# generates KeyError: 'Zarr object is missing the attribute `_ARRAY_DIMENSIONS`, 
# which is required for xarray to determine variable dimensions.'
stack = xarray.open_zarr(store=store, group='/fov_000')

# try to save something with xarray and re-load it

import starfish.data

# get baristaseq data
data = starfish.data.BaristaSeq()
primary = data['fov_000']['primary']
aux = data['fov_000']['nuclei']

# create an xarray dataset
# note that zarr requires that each dimension have coordinates; by default
# our xarrays do not use coordinates for x and y, so we will need to create
# them.
for dataarray in (aux.xarray, primary.xarray)
    for dimension in ('x', 'y'):
        dataarray.coords[dimension] = range(dataarray.sizes[dimension])

# note that this will work, but will broadcase the nuclei; this should be changed, 
# and is below when I explore how to construct an experiment using zarr
ds = xarray.Dataset({'primary': primary.xarray, 'nuclei': aux.xarray})

# define a new s3 store, with create=True set
store = s3fs.S3Map(
    root='spacetx.starfish.data.public/zarr/baristaseq.xarray.zarr',
    s3=s3, check=True, create=True
)

# write the zarr archive
ds.to_zarr(store=store, mode='w')

# get it back and demonstrate that it worked.
from_s3 = xarray.open_zarr(store=store)

# How would we go about implementing an experiment? xarray.open_zarr() 
# appears to create dask arrays by default.
type(from_s3['nuclei'].data)

# these can be loaded with the following command, but xarray appears to 
# lazily load by default. Note that this is not the same as lazily downloading; 
# loading from s3 takes time and I haven't figured out local caching (yet!)
from_s3['nuclei'].data.load()

This should mean that we should be able to load up an experiment as a zarr dataset, and use
dask to do our memory management. We have a 2x baristaseq FOV in the form of the
baristaseq.xarray.zarr dataset. Our codebook is already an xarray, so...

Drafting an experiment using zarr backing

What would we prefer the structure to look like?

/
    codebook
    fov_001
        primary
        nuclei
        dots
        ...
        aux_image_N
    ...
    fov_00N

This should be adequate to produce a bunch of xarray objects and store them to zarr. The
requirements seem to be:

  • all objects are stored as DataSets
  • all objects must have coordinates for each dimension
  • all objects in a dataset must have the same dimensions

I think this will mean each object is going to need its own dataset (instead of storing
FOVs together), since we can't guarantee that a nuclei object isn't a projection.

We've got the pieces in-memory still, let's build it!

# get the data we need and convert them to datasets
codebook_ds = data.codebook.to_dataset(name='codebook')
primary_ds = xarray.Dataset({'primary': primary.xarray})
nuclei_ds = xarray.Dataset({'nuclei': aux.xarray})

# make a new store that we're going to build hierarchically
root = s3fs.S3Map(
    root='spacetx.starfish.data.public/zarr/baristaseq.experiment.zarr',
    s3=s3, check=True, create=True
)
zarr_root_view = zarr.group(store=root)
zarr_root_view.clear(); root.clear()  # just in case there's something in the way

# create a function that writes a dataset into a new group, avoiding collisions of 
# the extra keys that xarray creates.
def write_dataset(
    root: zarr.hierarchy.Group, group_name: str, dataset: xarray.Dataset,
    s3: s3fs.S3FileSystem
):
    group = root.create_group(group_name)
    group_url = ''.join((group.store.root, group.name))
    s3map = s3fs.S3Map(group_url, s3=s3, check=True)
    dataset.to_zarr(store=s3map, mode='w')
    return group, s3map

# write the codebook
write_dataset(zarr_root_view, group_name='codebook', dataset=codebook_ds, s3=s3)
print('wrote codebook')

# write two different fovs
for n in range(2):
    fov_group = zarr_root_view.create_group(f'fov_00{n}')
    write_dataset(fov_group, group_name='primary', dataset=primary_ds, s3=s3)
    print(f'wrote primary fov {n}')
    write_dataset(fov_group, group_name='nuclei', dataset=nuclei_ds, s3=s3)
    print(f'wrote nuclei fov {n}')

The resulting zarr archive looks like:

 ├── codebook
 │   ├── c (4,) int64
 │   ├── codebook (7, 4, 3) uint8
 │   ├── r (3,) int64
 │   └── target (7,) |S8
 ├── fov_000
 │   ├── nuclei
 │   │   ├── c (1,) int64
 │   │   ├── nuclei (1, 1, 17, 1000, 800) float32
 │   │   ├── r (1,) int64
 │   │   ├── x (800,) int64
 │   │   ├── y (1000,) int64
 │   │   └── z (17,) int64
 │   └── primary
 │       ├── c (4,) int64
 │       ├── primary (3, 4, 17, 1000, 800) float32
 │       ├── r (3,) int64
 │       ├── x (800,) int64
 │       ├── y (1000,) int64
 │       └── z (17,) int64
 └── fov_001
     ├── nuclei
     │   ├── c (1,) int64
     │   ├── nuclei (1, 1, 17, 1000, 800) float32
     │   ├── r (1,) int64
     │   ├── x (800,) int64
     │   ├── y (1000,) int64
     │   └── z (17,) int64
     └── primary
         ├── c (4,) int64
         ├── primary (3, 4, 17, 1000, 800) float32
         ├── r (3,) int64
         ├── x (800,) int64
         ├── y (1000,) int64
         └── z (17,) int64

The c, r, z, y, x data structures are an xarray hack to store the dimensions until zarr incorporates named dimensions.

This thing I built should be adequate to build an experiment around. Note that we're going to get dask for free, so we can play around a bit with the data and see what happens (does anything break in the BaristaSeq workflow?)

I don't forsee any issues with working with these data,
provided that an imagestack can be created.
With this in mind, I try load an xarray from zarr, test out a basic workflow on BaristaSeq and see if
the dask backing breaks anything.

import starfish

dataset = xarray.open_zarr(root, group='fov_000/primary'])
# unpack the DataArray from the Dataset
dataarray = dataset['primary']

# make an ImageStack - the easiest way I could find to make one that is almost 
# certainly not best way was to craft a synthetic stack and overwrite the data 
# with the xarray
num_round = dataarray.sizes['r']
num_ch = dataarray.sizes['c']
num_z = dataarray.sizes['z']
tile_height = dataarray.sizes['y']
tile_width = dataarray.sizes['x']
empty = starfish.ImageStack.synthetic_stack(
    num_round=num_round,
    num_ch=num_ch,
    num_z=num_z,
    tile_height=tile_height,
    tile_width=tile_width
)

# this doesn't work because of the multiprocessing stuff, and I can't actually figure 
# out how to make it work...
imagestack = starfish.ImageStack(dataarray)

Well, I guess we'll play with apply over xarray to test if things work.

from skimage.filters import gaussian
from itertools import product
for (r, c, z) in product(*(range(v) for v in dataarray.shape[:3])):
    gaussian(dataarray[r, c, z])

That doesn't blow up, so I think the way that xarray works with dask should be transparent to our workflows. In addition, we should be able to rewrite thinks if we wanted to take advantage of dask.delayed or dask's out of memory operations.

@neuromusic
Copy link
Collaborator

That's incredible!

@joshmoore
Copy link
Member

A few notes from looking at this today:

  • I have a minio setup for docker-ized testing which we can add to the integration tests if there's interest. I've asked the same on the zarr side: S3 anonymous Zarr data examples... zarr-developers/zarr-python#385
  • I looked into used nested chunks rather than dotted chunks to no avail: Nested cloud storage zarr-developers/zarr-python#395
  • Use of group= on xarray.Dataset.to_zarr also works well
  • "If we want listable directories, we'd need to implement a new storage class that tracks data using our json structure" I wonder if this isn't an issue of the s3fs implementation. It seems that with a well-known bucket, group and dataset (unlistable/example/my_data.zarr), it should be possible to open .zarray and find everything needed. That was at least my hope for both SPTX manifests as well as OME-XML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants