-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manifest storage transformer #287
Comments
In this proposal, what type of thing is |
All three are |
Can we write to |
I think we should seriously consider a much lighter-weight concatenation method. What about just storing references to The advantages of this are that
The metadata doc would somehow contain pointers to the other metadata docs. Something like "concatenation": {
"axis": 0,
"arrays": ["../foo", "../bar"]
} The one part I can't quite see is how to do the references to the arrays. Some sort of URL syntax? Absolute vs. relative paths? |
Another way of putting it is that I think perhaps "chunk manifest" and "virtual concatenation of Zarr arrays" should be completely separable and orthogonal features. |
Note that the kerchunk method and its child here already allow for content-addressable storage, e.g., IPFS. Not sure if you meant something beyond that. There has been chatter elsewhere of chunk checksums and such (stored in metadata, not the bytes of the chunk). For the concatenation, I would want special attention paid to the multi-dimension case. Also, some consideration of groups-of-arrays which are concatenated together would be nice, but you might say that this is an xarray concern. Are you at all considering that the array chunk grid not aligning with the chunks? Do I understand that you imagine an output metadata structure of the main "these are the arrays" and then a JSON for each of the target arrays? Or do you end up concatenating the reference lists somewhere along the way? One important possible extension to consider along with those given - after a prototype is established - is that we now have a way to pass per-chunk information (analogous to the "context" I fought for), and so can have different behaviours for each chunk, like a different zero point in offset-scale filtering. |
I've come around on this, but not for exactly the same reason. I've now redacted my original proposal which was not 100% thought though.
Certainly some parallels here but this could be done without IPFS. @alimanfoo's proposal in #82 is still a good read, despite using some now-outdated vernacular.
Again, I'm going to remove this from the proposal. But I'll just say that there are some parallels with @d-v-b's proposal to "fix zarr-python's slicing" (zarr-developers/zarr-python#1603, zarr-developers/zarr-python#980) - namely the creation of a lazy Zarr Array or ArrayView that wraps one or more Zarr array. If we take serialization off the table for now, we can think of this outside the spec conversation and explore how to address this at the implementation level.
I was thinking of concatenating the references but have walked this back because you have to enforce that all array metadata is equivalent (e.g. codecs) for all concatenated arrays. @rabernat is suggesting another approach with could work to resolve those concerns. |
This is very similar to the kerchunk Reference File System format but is not exactly the same JSON format: There are also at least a few implementations of the kerchunk json format outside of kerchunk itself:
Would it be advantageous to use exactly the same format? |
Can you please put references? They might be useful for inspiration. |
I updated my comment to include one other known implementation. |
@martindurant Is there a document that describes the kerchunk parquet format? |
No, but I could make one. |
While we can all assume what |
Another issue to consider is the Confused deputy problem: user A might think they are writing to "s3://someone-elses-bucket/path" but actually end up writing with user A's credentials to "s3://user-a-private-bucket/other/path". Similarly, user A may think they are exposing "s3://someone-elses-bucket/path" over an HTTP server but actually end up sharing data from "s3://user-a-private-bucket/other/path" or "file:///etc/passwd". |
I think that would be very helpful. |
@jbms - I have a few answers to your question of "why not use the kerchunk format":
@rabernat - missed your first comment:
Perhaps! I have not covered this use case yet above but it could be possible. It would be tricky to update the manifest in a consistent way across multiple updates. I suggest we treat arrays with manifest storage transformers as read-only for this initial conversation. |
kerchunk is amenable to change :). Especially if it can also maintain compatibility. |
I can see that there are advantages to splitting but I think that is mostly orthogonal to the issue of the metadata format.
Yes there are some idiosyncrasies and I suppose kerchunk also assumes URLs are fsspec-compatible. Still given that it is designed to address essentially exactly the same thing as kerchunk, I think it would be desirable to avoid fragmentation if possible. Particularly since there is mention of not just a json format but also a parquet format, which kerchunk also has. Maybe Martin is open to evolving the format used by kerchunk? On the other hand given the nature of these manifest formats it is relatively easy to support multiple formats since you can just convert one to the other when you load it. |
Yes, of course: we want everything to work well together. In the current design, I suppose it's already possible to "concatenate" a kerchunk-zarr with a normal zarr. (actually, kerchunk can also reference a zarr, so something like this was already possible on v2) |
Also worth pointing out that kerchunk's current implementation has some specific v2 stuff in it, so something will have to change for v3 no matter what. |
As I see it, this "manifest" format could be used as a key-value store adapter independent of zarr entirely, as a transparent layer below zarr that is not explicitly indicated in the zarr metadata (i.e. as kerchunk is currently used), or as a storage transformer explicitly indicated in the zarr metadata. Re concatenation: I think as has been discussed that is not especially a practical use case even with variable-size chunks and instead we could discuss a solution for that independently, e.g. an explicit "concatenation" / "stack" extension for zarr. See this support in tensorstore for constructing virtual stacked/concatenated views (https://google.github.io/tensorstore/driver/stack/index.html). |
One thing that would likely be important for concatenation is the ability to specify "cropping" and other coordinate transforms -- for that the "index transform" concept in tensorstore may be relevant to consider: https://google.github.io/tensorstore/index_space.html#index-transform |
I realized my last answer may have unintentionally come off as critical of the Kerchunk project. Apologies is it came across that way. Kerchunk (@martindurant) has done us all a great service by showing us what is possible here. My point above was really trying to look forward and mesh the ideas Kerchunk has introduced with the Zarr storage transformer framework. And at the same time, opening some doors for additional extensions beyond those of the Kerchunk project. Based on @martindurant's comments, it sounds like there is plenty of room to work together on, what could be, a new spec complaint storage layout for Kerchunk. |
Not at all, that's why we have these conversations. We already have redundant code for "view set of datasets" from xarray and dask, which have particular views on what arrays are and how they work. I will say, though, that kerchunk aims to work beyond the netCDF model alone (xr trees to start, but more complex zarr group trees too) and even beyond zarr (e.g., from the simplest, supermassive compressed CSV with embedded quoted fields, to making parquet directory hierarchies and assembling feather 2 files from buffers). Whether those ideas are worth pursuing remains to be seen, but I expect there will always be some bespoke combine. logic in the kerchunk repo.
Yes, from the combine user API to reference storage formats and more. |
I'm not sure I understand what this means. Can someone give a concrete example? @jhamman How hard would it be to support appending to one dimension of a chunk manifest? People are asking for that feature in VirtualiZarr (zarr-developers/VirtualiZarr#21), and I could imagine a neat interface like xarray's |
I'm not sure what syntax would be preferred, but let's say instead of using a JSON object we use a JSON array, e.g. the following representation for your initial example: [
{"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
{"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
{"key": "0.1.0", "path": "s3://bucket/foo.nc", "offset": 300, "length": 100},
{"key": "0.1.1", "path": "s3://bucket/foo.nc", "offset": 400, "length": 100},
} Then we could support "prefix" in place of "key" to map an entire prefix: [
{"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
{"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
{"prefix": "0.1.", "path": "s3://bucket/bar."},
} This would map "0.0.0" and "0.0.1" as before, but "0.1.0" would map to "s3://bucket/bar.0" and "0.1.1" would map to "s3://bucket/bar.1". It would not be permitted to specify an [
{"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
{"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
{"prefix": "0.1.", "path": "s3://bucket/bar/manifest.json|zarr_chunk_manifest:"},
} This would map "0.1.0" to "s3://bucket/bar/manifest.json|zarr_chunk_manifest:0", which would then get resolved by querying "0" within the manifest at "s3://bucket/bar/manifest.json". Since the array representation creates the possibility of conflicts between keys and prefixes, we can say that later entries always take precedence: [
{"prefix": "", "path": "s3://bucket/baz/"},
{"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
{"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
{"prefix": "0.1.", "path": "s3://bucket/bar/manifest.json|zarr_chunk_manifest:"},
} The initial Slightly more general than prefixes is to allow arbitrary lexicographical ranges: [
{"key": "0.0.0", "path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
{"key": "0.0.1", "path": "s3://bucket/foo.nc", "offset": 200, "length": 100},
{"min": "0.1", "max": "0.9", "strip_prefix": 2, "path": "s3://bucket/bar/"},
} Any key Note that any prefix entry could be represented as a range entry: A prefix of "0.1." is equivalent to {"min": "0.1.", "max": "0.1/", "strip_prefix": 4}. The reason that the max is "0.1/" is because "/" is the Unicode (and ascii) character that follows ".". In practice I would expect implementations would handle this by converting the mappings to a sorted list of disjoint keys/ranges. Then key lookup can be done with a binary search. |
Thank you @jbms! I now am following the conversation again 😅 So whilst that prefix stuff is cool, I do wonder if the additional complexity needs to be considered. The simple Also it seems to me that the main use case of the |
I completely agree with @TomNicholas .
|
I have mostly been considering this JSON format in the context of general use, e.g. more as a generic key-value store adapter like the kerchunk reference filesystem, or like a zarr group storage transformer, rather than just a zarr array storage transformer in particular, because the proposed JSON representation really wasn't specific to zarr arrays at all. The prefix and lexicographical range mapping I mentioned would indeed be more useful for non-array uses, e.g. it would allow you to compose a "virtual group" from arrays or groups located in different places. For arrays one potential use would be to define a default prefix mapping (empty prefix) and then override a small number of individual chunks, as a sort of "patch" for an existing array. Other uses for non-empty prefix mappings for arrays would potentially be to override various sub-regions of the array, but indeed that would probably be better represented via the virtual concatenation proposal because doing it at the key level would be rather awkward. As far as representing the manifest as a single "structured" array or 3 arrays --- are we talking about the on-disk format (i.e. something entirely different from the proposed JSON format), or are we talking about an in-memory representation only? For the on-disk format, if the mapping is not expected to be sparse, and the total number of chunks is very large, then using a chunked representation in the form of a zarr array to represent the mapping could make sense (where each element of this mapping zarr array corresponds to a chunk within the logical zarr array), and indeed prefix or range mapping doesn't fit into that representation at all. Potentially this could also be viewed as a type of "array -> bytes" codec rather than a storage transformer, but I'm not sure whether that is ultimately better. I agree that a columnar representation, where within each chunk of the mapping array, the urls, offsets, and lengths are compressed independently, would be very helpful in that case. However, it would be unfortunate if the urls, offsets, and lengths for a given chunk are actually stored separately, because that would mean you need to do 3 reads instead of 1 in order to load the mappings for a given chunk, and there would be little reason to want to access the fields separately. Even if we consider the more general mapping case (i.e. like kerchunk reference filesystem or a zarr group storage transformer), I agree that a columnar storage format (e.g. perhaps parquet) would be advantageous, though for small mappings there are advantages to JSON. I think that prefix and lexicographical range mappings can fit pretty easily into such a format, though. For example you could have 4 columns: min_key, max_key, strip_prefix_length, offset, length, where for individual key mappings we use (min_key, offset, length) and for key range mappings we use (min_key, max_key, strip_prefix_length). Parquet, for example, can represent missing fields pretty efficiently, and assuming you have normalized all of the key ranges to be disjoint, and order the entries by |
I'm still not seeing what the use case for prefixes is that couldn't be supported through redirection via chunk manifests + virtual concatenation.
I was talking about both, and linked to two separate issues in the
What on-disk data types does zarr v3 support? Seems like that page of the spec has not been written yet. I ask because in numpy there is an in-memory datatype that contains the url, offset, and length all in one, and if we could save that data type to disk we would not need 3 reads. |
For redirecting an entire array, for example, using a chunk manifest means that you have to fetch the list of all of the chunks, which while perhaps desirable in some cases in other cases would be unnecessary and expensive. Additionally, using a chunk manifest means that the linked array must be immutable --- support for writing is lost, and also any further changes to the linked array will, in general, break the manifest. For redirecting an entire group, this issue applies to an even greater extent.
The data types are described here: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types In zarr v2, "structured data types" equivalent to numpy structured data types are supported, but they are not part of the zarr v3 spec. Note that numpy structured data types interleave the fields (which may or may not be good for in-memory representation, but usually is not good for on-disk representation), and numpy 2 does not currently allow StringDType in a structured dtype. |
kerchunk parquet is chunked and supports fast random access. It is also efficient both on-disc and in memory.
kerchunk parquet supports writing reference sets back to the original or to a new location. You can update one chunk without altering the rest. (There are no locks on this process for multiple writers, but there could be)
This is a terrible idea for storage - there is a reason that parquet has won so comprehensibly for bulk tabular data. For numpy-style specificaly, only fixed-length string fields can be stored anyway, for which you had better have very similar values and know the max length ahead of time. |
I haven't looked at it in detail, but I would indeed be inclined to think that parquet is a good choice for this use case. I think you could also add support for prefix and/or key-range mappings to the kerchunk parquet format pretty easily. For the specific case where a prefix mapping might be used, i.e. map "myarray1/" -> "s3://myarray1-bucket/" and "myarray2/" -> "s3://myarray2-bucket/", the prefix mapping will in general be much more efficient than even the most efficient non-prefix map, since it is constant space. The exception is that listing would almost surely be faster with an explicit manifest.
Basically a prefix map is analogous to a symlink to a directory, while an explicit chunk manifest (i.e. no prefix maps) would be analogous to a directory of symlinks to files. Both have uses, some use cases might be well served by either representation, and certain use cases will favor one representation over the other. With a prefix map, you don't need to perform an "indexing" step to generate the explicit manifest in the first place, and you will automatically pick up any new files that are added to the source location. To me, prefix and key-range maps seem pretty powerful since you can combine them with any other adapters (zip files, another layer of chunk manifest, etc.) supported by the URL syntax. However, I can understand that they may not be helpful for the use cases you may be thinking of, like representing an hdf5 array as a zarr array. It might be reasonable to exclude support for prefix/range maps from an initial version of this chunk manifest format, but it might be helpful to design the format with the possibility of adding that later.
|
I think I am arguing that prefix maps with links alone are fine, and how concat/merge can work (as in virtualizarr); and all-references like kerchunk already uses are fine; but I would not mix them. It's worth pointing out that the kerchunk spec allows for templating URLs, but the feature wasn't much used. In fact, compression on strings is such, that a column of strings sharing a small number of prefixes compresses almost to the same size as those paths without prefixes.
|
So elephant in the room: Now that the Icechunk Spec is out (even if some of the details are still being settled), is still there much of a use case for the json-based non-versioned chunk manifests as suggested above? |
Here is an example use case for a OME-NGFF spec'd zarr. I began working through the Icechunk example workflow but VirtualiZarr needs some fixes and better test coverage for TIFFs (opened first issue here). I have limited bandwidth but I can help with that. For Icechunk am I understanding correctly that the data would still be accessible following the OME-NGFF spec but on disk the files would exist in the Icechunk spec? I'm sure some in the OME community would push back on that, especially since raw microscopy data should never be changed invalidating the main value add of Icechunk. But I'm happy if I just have something that works. |
As the author of this proposal and one of the creators of Icechunk, I can say that Icechunk is where I'm going to be putting my focus. The design of Icechunk was motivate by many of the same needs described above (and more!). That doesn't mean we have to abandon this as an idea, just that I don't have plans to peruse it. I know some folks have build some prototypes around this concept so I'm happy to leave this open.
@elyall - I think you've got one detail wrong. If you are using Icechunk to store virtual references (from Kerchunk or VirtualiZarr), Icechunk will just store the pointers to your data. The original files (tiff, etc) will remain unchanged. |
This issues describes a concept for a Zarr v3 Storage Transformer to enable generic indirection between the Zarr keys and the name of the underlying objects in a store. It is not a new idea (see below) but this design is meant to cover a broader set of use cases.
Goals
Design
There has been a lot written on this subject already (see issues linked above) so I'm going to attempt to jump straight into the design. The key difference between this design and prior proposals is that the manifest will be local to the Array. The reason for this is to increase the scalability, portability, and composability of the manifest concept.
Store layout
The manifest store layout will resemble that of a regular Zarr V3 store. Consider the following directory store representation:
Note: array
a/foo
is a manifest array but arrayb/baz
is a regular zarr array.Array metadata
Manifest style arrays will need to declare a storage transformer configuration:
Note: the small manifests could also be inlined directly into the array metadata object.
Manifest object
In my example above, the array
a/foo
includes a manifest object (a/foo/manifest.json
) which will store the mapping of chunk keys to keys in the store:path
would be the only required key, offset/length/checksum/etc could all be added keys to a) inform the store how to fetch bytes from the chunk or b) provide the store with additional metadata about the chunk.Note 1: Kerchunk also supports inline data in place of the path. That could also be supported here.
Note 2: I'm using JSON as a manifest type here, but many other options exist, including Parquet or even Zarr arrays.
Concatenating arrays:
Edit: Feb 6 7:20p PT - After thinking about this more, I'm beginning to think serialization of concatenated arrays is a trickier problem than should be addressed in the initial iteration here. The main tricky bit is how to combine arrays with compatible dtypes/shapes/chunks but with differing codecs. Details from my original ideas below but consider this redacted from the proposal for now.
Details
One of the goals above is to enable concatenating multiple Zarr arrays. The manifest approach supports a zero-copy way to achieve this. The concept here closely resembles the approach from [Kerchunk's MultiZarrToZarr](https://fsspec.github.io/kerchunk/tutorial.html#combine-multiple-kerchunked-datasets-into-a-single-logical-aggregate-dataset), except it targeting individual arrays and could be made to work with any zarr arrray (not just Kerchunk references). The idea is that concatenating arrays can be done in Zarr, provided a set of constraints are met, by simply rewriting the keys. Implementations could provide a API for doing this concatenation like:
In this example,
zarr.concatenate
would act similarnumpy.concatenate
, returning a newzarr.Array
object after creating the new manifest instore_c
. This could also be done in two steps by adding asave_manifest
method to the Zarr arrays.Possible extensions
I've tried very hard to keep the scope of this as small as possible. There are currently few v3 storage transformers to emulate so I think the best next step is to try out this simple approach before spending too much time on a spec or elaborating on future options. That said, there are some obvious ways to extend this:
Props
🙌 to those that have done a great job pushing this subject forward already: @martindurant, @alimanfoo, @rabernat among others.
The text was updated successfully, but these errors were encountered: