-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ImageStack zarr saving and loading #419
Comments
I dug into zarr, xarray, and dask a bit. Learnings from playing with zarr
Store and load an ImageStack using Zarr and s3fs# spacetx profile holds my aws credentials for our bucket, granting LIST access.
# as an alternative, we can simply use a public bucket whose root has READ and
# LIST access.
import boto3
import s3fs
import zarr
import xarray
# start a session, configuring s3 credentials
session = boto3.Session(profile_name="spacetx")
# create an s3 file system
s3 = s3fs.S3FileSystem(
anon=False, session=session, client_kwargs=dict(region_name="us-west-1")
)
# confirm root list access works
s3.ls('spacetx.starfish.data.public')
# create a MutableMapping to use as a store
store = s3fs.S3Map(
root='spacetx.starfish.data.public/zarr/baristaseq.zarr', s3=s3, check=False
)
# set the store as a zarr target and look at the contents
root = zarr.group(store=store)
root.tree()
# get a piece of the array
array = root['fov_000']['primary']
subset = array[1, 1, 1]
# try to open this directly from xarray
# generates KeyError: 'Zarr object is missing the attribute `_ARRAY_DIMENSIONS`,
# which is required for xarray to determine variable dimensions.'
stack = xarray.open_zarr(store=store, group='/fov_000')
# try to save something with xarray and re-load it
import starfish.data
# get baristaseq data
data = starfish.data.BaristaSeq()
primary = data['fov_000']['primary']
aux = data['fov_000']['nuclei']
# create an xarray dataset
# note that zarr requires that each dimension have coordinates; by default
# our xarrays do not use coordinates for x and y, so we will need to create
# them.
for dataarray in (aux.xarray, primary.xarray)
for dimension in ('x', 'y'):
dataarray.coords[dimension] = range(dataarray.sizes[dimension])
# note that this will work, but will broadcase the nuclei; this should be changed,
# and is below when I explore how to construct an experiment using zarr
ds = xarray.Dataset({'primary': primary.xarray, 'nuclei': aux.xarray})
# define a new s3 store, with create=True set
store = s3fs.S3Map(
root='spacetx.starfish.data.public/zarr/baristaseq.xarray.zarr',
s3=s3, check=True, create=True
)
# write the zarr archive
ds.to_zarr(store=store, mode='w')
# get it back and demonstrate that it worked.
from_s3 = xarray.open_zarr(store=store)
# How would we go about implementing an experiment? xarray.open_zarr()
# appears to create dask arrays by default.
type(from_s3['nuclei'].data)
# these can be loaded with the following command, but xarray appears to
# lazily load by default. Note that this is not the same as lazily downloading;
# loading from s3 takes time and I haven't figured out local caching (yet!)
from_s3['nuclei'].data.load() This should mean that we should be able to load up an experiment as a zarr dataset, and use Drafting an experiment using zarr backingWhat would we prefer the structure to look like?
This should be adequate to produce a bunch of xarray objects and store them to zarr. The
I think this will mean each object is going to need its own dataset (instead of storing We've got the pieces in-memory still, let's build it! # get the data we need and convert them to datasets
codebook_ds = data.codebook.to_dataset(name='codebook')
primary_ds = xarray.Dataset({'primary': primary.xarray})
nuclei_ds = xarray.Dataset({'nuclei': aux.xarray})
# make a new store that we're going to build hierarchically
root = s3fs.S3Map(
root='spacetx.starfish.data.public/zarr/baristaseq.experiment.zarr',
s3=s3, check=True, create=True
)
zarr_root_view = zarr.group(store=root)
zarr_root_view.clear(); root.clear() # just in case there's something in the way
# create a function that writes a dataset into a new group, avoiding collisions of
# the extra keys that xarray creates.
def write_dataset(
root: zarr.hierarchy.Group, group_name: str, dataset: xarray.Dataset,
s3: s3fs.S3FileSystem
):
group = root.create_group(group_name)
group_url = ''.join((group.store.root, group.name))
s3map = s3fs.S3Map(group_url, s3=s3, check=True)
dataset.to_zarr(store=s3map, mode='w')
return group, s3map
# write the codebook
write_dataset(zarr_root_view, group_name='codebook', dataset=codebook_ds, s3=s3)
print('wrote codebook')
# write two different fovs
for n in range(2):
fov_group = zarr_root_view.create_group(f'fov_00{n}')
write_dataset(fov_group, group_name='primary', dataset=primary_ds, s3=s3)
print(f'wrote primary fov {n}')
write_dataset(fov_group, group_name='nuclei', dataset=nuclei_ds, s3=s3)
print(f'wrote nuclei fov {n}') The resulting zarr archive looks like:
The c, r, z, y, x data structures are an xarray hack to store the dimensions until zarr incorporates named dimensions. This thing I built should be adequate to build an experiment around. Note that we're going to get I don't forsee any issues with working with these data, import starfish
dataset = xarray.open_zarr(root, group='fov_000/primary'])
# unpack the DataArray from the Dataset
dataarray = dataset['primary']
# make an ImageStack - the easiest way I could find to make one that is almost
# certainly not best way was to craft a synthetic stack and overwrite the data
# with the xarray
num_round = dataarray.sizes['r']
num_ch = dataarray.sizes['c']
num_z = dataarray.sizes['z']
tile_height = dataarray.sizes['y']
tile_width = dataarray.sizes['x']
empty = starfish.ImageStack.synthetic_stack(
num_round=num_round,
num_ch=num_ch,
num_z=num_z,
tile_height=tile_height,
tile_width=tile_width
)
# this doesn't work because of the multiprocessing stuff, and I can't actually figure
# out how to make it work...
imagestack = starfish.ImageStack(dataarray) Well, I guess we'll play with apply over xarray to test if things work. from skimage.filters import gaussian
from itertools import product
for (r, c, z) in product(*(range(v) for v in dataarray.shape[:3])):
gaussian(dataarray[r, c, z]) That doesn't blow up, so I think the way that xarray works with |
That's incredible! |
A few notes from looking at this today:
|
We want to begin testing the zarr library more earnestly. A good start would be to write a zarr.save and zarr.load function for ImageStack. Eventually this should hook into the experiment API.
The text was updated successfully, but these errors were encountered: