-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protocol extensions #49
Comments
Very timely q, @hammer 😀; @joshmoore, @jakirkham, and I have been discussing a simple but expressive formulation of "plugins" that can be supported by Zarr relatively non-invasively. Here's a writeup of my thinking coming from those discussions; it's by no means authoritative, and I'm not even sure we all mean the same things when we refer to "plugins" vs. "extensions", but those discussions seem to finally be happening more concretely after a bit of a hiatus: "Plugins" draft proposalA surprising variety of extensions to core Zarr behaviors can be enabled via a simple hook into Zarr's "load" and "save" paths, intercepting the transformation between [a given "on-disk" hierarchy] and [a Zarr "Read" side
"Write" sideA straightforward inversion of the above applies; a plugin presents to Zarr's write/save machinery as:
ExamplesLots of commonly-discussed extensions to Zarr can be modeled by the above scheme: Sparse arraysSparse arrays are a great "hello world" example of a plugin:
This would allow users to load a Zarr tree with a scipy.sparse array substituted for Groups of a certain shape. This is approximately how the zsparse proof-of-concept is implemented. N5 interop/supportN5 and Zarr have similar specs (cf. #3), and there are plans for Zarr to more seamlessly support+interoperate with N5 in v3. N5 groups and datasets could be loaded virtually, as their Zarr counterparts, by an appropriate "plugin" that made the requisite changes to how paths are loaded from and saved to. N5 support is currently provided by the N5Store, but that doesn't compose with other "stores" (cf. zarr-developers/zarr-python#395). "Plugins" as described here could solve this issue; it's also possible that a cleaner way to nest/compose "stores" would work (while keeping "extensions" more narrowly-scoped). Pyramidal DataVarious ways of encoding "pyramidal data" (e.g. the same image data, duplicated at various power-of-2-downscaled resolutions, common in the imaging and geo communities and a widely requested Zarr feature; cf. #23) can be implemented as "plugins" by injecting a processing step during loading/saving (similarly to the sparse arrays example above). Specific encodings can vary:
HDF5/TileDB interop/supportCombining ideas from the N5 and Pyramidal examples above, other binary/opaque structures could be embedded in a Zarr tree. An "HDF5" plugin for Zarr could:
Consolidated MetadataZarr "plugins" as described can change how metadata is loaded (e.g. loading a single "consolidated metadata" file containing the metadata for all descendents of a Group, and virtually attaching each SymlinksA plugin can read a specific zattr key as encoding a pointer or symlink to another Zarr Group or Dataset, allowing linking (cf. zarr-developers/zarr-python#297, zarr-developers/zarr-python#389) Heterogenous Chunk Sizescf. #40. This is less directly something that only needs a different code-path injected during loading/saving time; getitem/setitem on the returned object would need to be handled substantially differently. It may be outside the scope of "plugins" as currently conceived of here. Non-JSON Metadata, Attributescf. #37. Likely implementable as a plugin in this scheme. Discussion
Next Steps
|
I want to throw in another use case I was thinking about for quite some time, which is overlapping chunks. The problem is that for moving window operations in >1D you can not process your data chunk by chunk, even if all you need are just a few data points from the boundary a the neighbouring chunk. So the idea would be to duplicate some data at the chunk boundaries to make read operations that read a chunk+ a few neighbouring data points faster. Software implementing this extension would need:
I think (1) could be handled by hooking into load/save as you described. But (2) and (3) I guess this would not be possible with your current proposed save/load hook? I don't think this should stop your approach since it indeed covers a lot of cases, but I just wanted to mention it here, maybe others had similar thoughts already and have ideas how towards an implementation. This leads me to the next question: Should/can Zarr extensions be language-specific? That means I could potentially implement overlapping chunks as an optional extension to the Zarr Julia package. Would it be necessary to provide a python implementation as well? I just want to understand if what you call "plugin" is thought from a software side or from a specs side. I could quite easily write a spec extension for my scenario just describing how the data would be stored in the chunks, but the implementation would be language-dependent. |
Thanks @ryan-williams, this is really helpful. One quick comment, I think it helps to make a distinction between protocols and implementations. A protocol is a definition of how data and metadata are organised, transferred and processed, and should be described in a way that is independent of any programming language. We are working on a zarr core protocol. Hopefully there will be multiple implementations of that protocol, in various programming languages. The community will also want to define a variety of protocol extensions. For any given protocol extension, there will then be at least one implementation. Some protocol extensions (e.g., consolidated metadata) may be general purpose and widely used, and so many of the software libraries which implement the core protocol may also choose to implement some of these protocol extensions. Some protocol extensions may be a bit more specialist or domain-specific, and for those it may in some cases make sense to implement in a separate library, as a plugin to an existing core protocol library. To put this another way, a software library which implements the core protocol might also provide some hooks where additional "plugin" code can be registered and can be delegated to provide special behaviour when a protocol extension is being used and has been declared in the zarr metadata. So for each of the examples given above, I would ask two separate questions:
There is also a third question, which is, does the zarr core protocol provide the necessary extension points to allow such a protocol extension to be declared? Hope that's useful, I'll follow up with some discussion of specific examples shortly. |
Speaking generally, for any of the examples given above, I imagine that a spec for a protocol extension would ideally include things like:
To take the example of consolidated metadata, this is a protocol extension which is intended to accelerate reading of zarr metadata, particularly in storage systems with high latency such as cloud object stores. It is used for data that is typically written once and read many times, and thus where the zarr metadata is relatively static once the hierarchy is created. The idea is to combine all the array and group metadata documents from a hierarchy into a single document that can be read via a single storage request. Ideally some kind of persistent URI would be declared for this protocol extension, e.g., something like This would be a protocol extension that can be safely ignored. I.e., a zarr reader reader could ignore it and still provide useful functionality, because all of the original metadata documents are still present. The protocol extension spec would then define the concept of a consolidated metadata document, and define its format (JSON) and how it would be structured internally to store the contents of all the array and group metadata documents. The spec would also declare the storage key under which this document would be stored. The spec would also define how to declare the protocol extension is in use, along with any configuration options. In this case, because this is a protocol extension that applies to the whole hierarchy, this could be done via the zarr entry point metadata, e.g., a hierarchy with consolidated metadata would have entry point metadata like:
Adding this information to the entry point metadata provides a standard route for implementations to discover that consolidated metadata is present. It would also tell the implementation where to look for the consolidated metadata (in this case under the "metadata.json" key in the store). Any other variations, such as whether the consolidated metadata document was compressed or not, could also be declared here via the extension config. A zarr reader which implements this extension would start by reading the entry point metadata document, and would look to see if the extension had been declared. If it was, it would then retrieve the document under the "metadata.json" storage key, parse it, and use the consolidated metadata instead of the individual array and group metadata documents. The spec would probably also need to deal with issues like what if consolidated metadata is out of sync with array and group metadata documents (e.g., it is up to the user to keep them in sync in a way that makes sense for the given dataset and its intended usage.) Hope that helps to illustrate what a protocol extension might look like for a particular example. I'll try to do a couple more. |
To work through another example, here's some thoughts on how to do sparse arrays. The idea of a sparse arrays protocol extension is to define a common convention for how to store the different component arrays of a sparse array. The URI of the protocol extension could be something like The sparse arrays protocol extension spec would define conventions for how to store data for each of the common types of sparse arrays. For example, for a sparse array using the compressed sparse row (CSR) representation, the data should be stored in three component 1-D arrays named "data", "indices" and "indptr", which are siblings within the same group. The spec would describe what each of these component arrays should contain, any constraints on dtype and shape, etc. This would be a protocol extension that can be safely ignored, i.e., a zarr reader could ignore it and still provide some useful functionality, because it might still be useful to view the underlying group and arrays or manually extract data from a component array. Because this is a protocol extension that applies to individual groups within a hierarchy, the spec would then also define how to declare that a particular group contains a sparse array conforming to the spec. This could be done via the zarr group metadata document, e.g.:
An implementation might then offer a function to load a sparse array from a given path within a zarr hierarchy. For example, if a sparse array had been stored in a group at path Similarly, an implementation might offer a function to save a sparse array to a given hierarchy path within a zarr hierarchy. E.g., in Python this function might accept an instance of |
Here's an illustration of how the non-regular chunks example could be accommodated as a protocol extension. This example is a bit different from the others above, because this hooks into a specific extension point within the v3 core protocol, because it is an alternative chunk grid type. The protocol extension spec for non-regular chunks would describe how in some use cases it is necessary to organise data into chunks where chunks form a rectilinear grid. The spec would also give a precise definition of this grid type. E.g., (borrowing from wikipedia) a rectilinear grid is a tesselation of an array's index space by rectangles (or rectangular cuboids) that are not, in general, all congruent to each other. The chunks in the grid may still be indexed by integers as in a regular grid, but the mapping from indexes to vertex coordinates is not uniform than in a regular grid. As in the other examples, the extension would need a URI, e.g. The protocol extension would then define how to store the sizes of the chunks along each dimension in the grid. Given a general rectilinear grid, this would require a list of integers, one for each dimension of the array, where the length of each of these lists corresponds to the number of grid divisions in that dimension, and the integer values give the chunk sizes along that dimension. Given that this is a protocol extension that applies to an individual array within a hierarchy, this would be declared within a zarr array metadata document. There is a specific hook for this type of extension, which is using the
The store keys used to store array chunks could still be the same as for a regular grid. I.e., you could still use keys like "0.1.3" for a chunk in a 3D array, because a rectilinear chunk grid can be indexed in the same way that a regular chunk grid can, even though the chunk sizes are not equal. The protocol extension spec would then need to describe how an implementation would deal with operations that are reading or writing to a specific region within an array using this grid type. This would include a description of how to identify the set of chunks that overlap a given array region. Note that an implementation of this protocol extension would need to hook into and override the part of a core protocol implementation that deals with reading and writing regions of an array. You could still think of this as a "plugin" possibly, although it is a specific type of plugin that implements a chunk grid protocol extension. Also note that this is an example of a protocol extension which a zarr reader must understand in order to read data correctly. However, I did not include a |
@DennisHeimbigner gave an update yesterday on the netcdf-c zarr implementation ("nczarr") during which he touched on the special issues that will be faced in C when trying to discover, instantiate, etc. extensions. I'll leave him to say more. |
Sorry I missed that, looks like a lot of great discussion. Certainly an important question to pin down the codec interface, decide if one interface can be used for both compressors and filters, and if we need to allow for data type information to be passed through in addition to raw buffers. |
As part of the netcdf zarr, I was looking forward to implementing types like |
Protocol extensions are mentioned throughout the v3.0 core protocol docs but they are still undocumented. Has there been any design discussions around how to specify and implement protocol extensions yet?
The text was updated successfully, but these errors were encountered: