Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for transpose and moveaxis #1256

Open
John-P opened this issue Nov 9, 2022 · 8 comments
Open

Support for transpose and moveaxis #1256

John-P opened this issue Nov 9, 2022 · 8 comments
Labels
enhancement New features or improvements

Comments

@John-P
Copy link

John-P commented Nov 9, 2022

I keep encountering situations where I would really like to use transpose or moveaxis as with a numpy array. This is possible via creating a dask array from a zarr array. However, this seems like something that should be a part of zarr-python. Is there any interest in implementing this, perhaps in the same or similar way that it has been for dask?

EDIT: Just posting this here as the Zarr-python issues page suggests posting new feature proposals here.

@joshmoore
Copy link
Member

Hi @John-P. Thanks for raising this & sorry for the confusion. If you are suggesting additional methods for zarr-python itself, then I'll transfer the issue back. If you are suggesting that the transpositions should be stored in the file format itself then this would be the place for the issue.

cc: @MSanKeys963

@joshmoore
Copy link
Member

xref: #1236

@John-P
Copy link
Author

John-P commented Nov 10, 2022

Ah ok maybe it is better on zarr-python then, sorry for that. I can repost over there if you want to close this.

@joshmoore
Copy link
Member

It's fine. I'll transfer. (Thanks for bearing with us)

@joshmoore joshmoore transferred this issue from zarr-developers/zarr-specs Nov 10, 2022
@jakirkham
Copy link
Member

As noted in the OP, there are already libraries that support this (like Dask). Would add XArray to that list. There may be even more as the Array API sees broader adoption.

Given there are libraries already solving the computation workflow side and Zarr is focused on the storage side, think keeping a cleaner separation of concerns (workflow from storage) will yield a better user experience (easy to see what to use, where to look, clear sense of how to compose). So would prefer not implementing this.

@John-P
Copy link
Author

John-P commented Nov 12, 2022

I've actually really struggled to use dask or xarray for this. Dask arrays don't seem to work in subprocesses (just silently hangs, although I have never used dask before and my may doing this wrong) and xarray does not seem to be able to handle a zarr array unless it is on disk (e.g. doesn't work with tifffile zarr store etc.), in a group, and with special metadata. I am just a quite baffled as to how difficult it is to do this.

Edit: Any advice of how to actually get this working with dask/xarray or similar would be appreciated as all of my attempts have encountered some critical issue.

@rabernat
Copy link
Contributor

Dask arrays don't seem to work in subprocesses

Dask is optional. You don't have to use Dask with Xarray.

xarray does not seem to be able to handle a zarr array unless it is on disk

Definitely not true. We use cloud-based Zarr arrays all the time.

If you can share a reproducible example of how you're trying to load data in Xarray, I'd be happy to try to help debug.

@John-P
Copy link
Author

John-P commented Dec 5, 2022

One of the major appeals of zarr for me was the ability to read arrays from subprocesses. However, I cannot get this to work with xarray. It simply hangs. Although from the documentation it appears that if that backend supports multiple processes then this should work. It functions just fine with zarr but when wrapping in xarray it deadlocks. I don't think I am using dask here unless xarray is doing something under the hood. I am able to get it to work if I do use dask and setup up a client etc but that seems like a lot on unnecessary complexity for simply reading the array.

Here is a simplified snippet of how the array is loaded:

import tifffile
import xarray as xr
import zarr

path = ...
tiff = tifffile.TiffFile(path)
# Zarr store contains a group with arrays under keys [0, 1, ...]
zarr_tiff_store = tiff.aszarr()
zarr_group = zarr.open(zarr_tiff_store, mode="r")
dataset = xr.open_zarr(zarr_tiff_store, consolidated=False)
# Xarray sets the dtype wrong so I have to copy over from zarr (a bug?)
for key, array in zarr_group.items():
    dataset[key] = dataset[key].astype(array.dtype)
# Normalise axes to be TZYXC
tzyxc_dataset = dataset.copy()
tzyxc_dataset["0"] = tzyxc_dataset["0"].expand_dims(
    dim=[a for a in "TCZYX" if a not in tzyxc_dataset["0"].dims],
)
tzyxc_dataset["0"] = tzyxc_dataset["0"].transpose(
    "T", "Z", "Y", "X", "C"
)

If I create this from the file path in a couple of subprocesses they deadlock when both trying to read the array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

5 participants