Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3: chunk_memory_layout could be specified as an explicit order rather than C or F #129

Open
jbms opened this issue Feb 8, 2022 · 7 comments
Labels
protocol-extension Protocol extension related issue

Comments

@jbms
Copy link
Contributor

jbms commented Feb 8, 2022

The current specification allows "C" and "F". But in some cases the optimal memory layout may not match the most natural dimension order, e.g. you might want the dimensions to be "zyxc", but the memory layout to be C order relative to the dimension order of czyx. To address that, instead of using "C" and "F", the memory order can instead be specified as an explicit list of dimensions, e.g. [0, 1, 2] for C order and [2, 1, 0] for Fortran order (assuming 3 dimensions). Numpy supports arbitrary dimension orders just fine.

@jbms
Copy link
Contributor Author

jbms commented Feb 8, 2022

Note: This is the representation used by TensorStore:
https://google.github.io/tensorstore/schema.html#json-ChunkLayout.inner_order

@joshmoore
Copy link
Member

@jbms: did you also see the proposal in #126 to remove order?

@meggart
Copy link
Member

meggart commented Feb 9, 2022

But in some cases the optimal memory layout may not match the most natural dimension order

Thanks for bringing this up. Can you link to some use cases or other documentation on this. In particular. In particular, I am not sure I understand what "natural" dimension order would mean, i.e. why should I present the data in a different order to the user than how it is stored?

@jbms
Copy link
Contributor Author

jbms commented Feb 9, 2022

Here is one example:

Suppose we are storing volumetric data indexed by x y z. It is natural to order the dimensions [x, y, z], or sometimes [z, y, x] if we want to use C order. But suppose we will be processing [x z] cross sections of the data, and therefore want the data to be stored as Fortran order relative to [x, z, y] for efficient access. For consistency, though, it may still be desired for the dimension order to be [x, y, z].

In general I see zarr as already an abstraction layer --- the data isn't actually stored in C order or Fortran order --- it is stored chunked and compressed, and it is only the intermediate uncompressed chunk representation that is in C order or Fortran order. If you use an image codec with zarr (see e.g. the imagecodecs Python package), this uncompressed C or Fortran order representation may not be relevant at all.

@jbms
Copy link
Contributor Author

jbms commented Feb 10, 2022

A better use case for this feature came up this evening: t5x (https://github.com/google-research/t5x) uses tensorstore to store machine learning model checkpoints. A user had modified the model to transpose the first two parameters of some variables, but wanted to load an existing checkpoint. This was possible without actually modifying the checkpoint or adding any special code to transpose when loading the model, by just modifying the tensorstore specs stored as part of the checkpoint to perform a transpose via an "index transform" (https://google.github.io/tensorstore/index_space.html#json-IndexTransform). However, it would be nice if this could be accomplished purely with zarr just by modifying the metadata file.

If we allow an arbitrary permutation as the chunk_memory_layout, and furthermore use the same order to generate the chunk keys, then we can transpose the dimensions of an array purely by modifying the metadata.

@jstriebel jstriebel added this to ZEP1 Nov 16, 2022
@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022
@jstriebel jstriebel moved this to In Discussion in ZEP1 Nov 16, 2022
@jstriebel
Copy link
Member

IMO it's a benefit to know the underlying data layout easily to be able to reason about efficiency when traversing and indexing an array. C and Fortran order are well-known concepts, whereas an arbitrary order is rather unusual. I'd argue that re-ordering the dimensions might still be allowed in an implementation, but this would not necessarily affect the metadata, similar to numpy.moveaxis not changing the underlying array, just providing a different view of the data.

If this seems to be more useful, I'd rather make this an extension than a core feature of zarr, do you agree @jbms?

@jstriebel jstriebel removed this from ZEP1 Nov 16, 2022
@jstriebel jstriebel added protocol-extension Protocol extension related issue and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 16, 2022
@jstriebel
Copy link
Member

PS: Especially with #162 it's possible for clients to re-order the axes as needed and not rely on an expected order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

4 participants