v3: chunk_memory_layout could be specified as an explicit order rather than C or F #129

jbms · 2022-02-08T18:05:48Z

The current specification allows "C" and "F". But in some cases the optimal memory layout may not match the most natural dimension order, e.g. you might want the dimensions to be "zyxc", but the memory layout to be C order relative to the dimension order of czyx. To address that, instead of using "C" and "F", the memory order can instead be specified as an explicit list of dimensions, e.g. [0, 1, 2] for C order and [2, 1, 0] for Fortran order (assuming 3 dimensions). Numpy supports arbitrary dimension orders just fine.

The text was updated successfully, but these errors were encountered:

jbms · 2022-02-08T18:07:11Z

Note: This is the representation used by TensorStore:
https://google.github.io/tensorstore/schema.html#json-ChunkLayout.inner_order

joshmoore · 2022-02-09T07:19:33Z

@jbms: did you also see the proposal in #126 to remove order?

meggart · 2022-02-09T16:24:59Z

But in some cases the optimal memory layout may not match the most natural dimension order

Thanks for bringing this up. Can you link to some use cases or other documentation on this. In particular. In particular, I am not sure I understand what "natural" dimension order would mean, i.e. why should I present the data in a different order to the user than how it is stored?

jbms · 2022-02-09T17:24:11Z

Here is one example:

Suppose we are storing volumetric data indexed by x y z. It is natural to order the dimensions [x, y, z], or sometimes [z, y, x] if we want to use C order. But suppose we will be processing [x z] cross sections of the data, and therefore want the data to be stored as Fortran order relative to [x, z, y] for efficient access. For consistency, though, it may still be desired for the dimension order to be [x, y, z].

In general I see zarr as already an abstraction layer --- the data isn't actually stored in C order or Fortran order --- it is stored chunked and compressed, and it is only the intermediate uncompressed chunk representation that is in C order or Fortran order. If you use an image codec with zarr (see e.g. the imagecodecs Python package), this uncompressed C or Fortran order representation may not be relevant at all.

jbms · 2022-02-10T03:03:35Z

A better use case for this feature came up this evening: t5x (https://github.com/google-research/t5x) uses tensorstore to store machine learning model checkpoints. A user had modified the model to transpose the first two parameters of some variables, but wanted to load an existing checkpoint. This was possible without actually modifying the checkpoint or adding any special code to transpose when loading the model, by just modifying the tensorstore specs stored as part of the checkpoint to perform a transpose via an "index transform" (https://google.github.io/tensorstore/index_space.html#json-IndexTransform). However, it would be nice if this could be accomplished purely with zarr just by modifying the metadata file.

If we allow an arbitrary permutation as the chunk_memory_layout, and furthermore use the same order to generate the chunk keys, then we can transpose the dimensions of an array purely by modifying the metadata.

jstriebel · 2022-11-16T16:21:17Z

IMO it's a benefit to know the underlying data layout easily to be able to reason about efficiency when traversing and indexing an array. C and Fortran order are well-known concepts, whereas an arbitrary order is rather unusual. I'd argue that re-ordering the dimensions might still be allowed in an implementation, but this would not necessarily affect the metadata, similar to numpy.moveaxis not changing the underlying array, just providing a different view of the data.

If this seems to be more useful, I'd rather make this an extension than a core feature of zarr, do you agree @jbms?

jstriebel · 2022-11-16T16:27:44Z

PS: Especially with #162 it's possible for clients to re-order the axes as needed and not rely on an expected order.

This was referenced Feb 10, 2022

Support for non-zero origin #122

Open

Remove 'order' from Specs and make 'C' default #126

Closed

jstriebel added this to ZEP1 Nov 16, 2022

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022

jstriebel moved this to In Discussion in ZEP1 Nov 16, 2022

jstriebel removed this from ZEP1 Nov 16, 2022

jstriebel added protocol-extension Protocol extension related issue and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 16, 2022

jstriebel mentioned this issue Nov 22, 2022

ZEP0001 - Core v3.0 spec for review #149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3: chunk_memory_layout could be specified as an explicit order rather than C or F #129

v3: chunk_memory_layout could be specified as an explicit order rather than C or F #129

jbms commented Feb 8, 2022

jbms commented Feb 8, 2022

joshmoore commented Feb 9, 2022

meggart commented Feb 9, 2022

jbms commented Feb 9, 2022

jbms commented Feb 10, 2022

jstriebel commented Nov 16, 2022

jstriebel commented Nov 16, 2022

v3: chunk_memory_layout could be specified as an explicit order rather than C or F #129

v3: chunk_memory_layout could be specified as an explicit order rather than C or F #129

Comments

jbms commented Feb 8, 2022

jbms commented Feb 8, 2022

joshmoore commented Feb 9, 2022

meggart commented Feb 9, 2022

jbms commented Feb 9, 2022

jbms commented Feb 10, 2022

jstriebel commented Nov 16, 2022

jstriebel commented Nov 16, 2022