-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to use /
as as separator instead of .
in spec v3.
#78
Comments
Thanks @Carreau. So I originally intended to add this feature into the initial v3 spec draft, but it would seem that I only got half-way there. E.g., in the section on regular chunk grids I wrote:
But then I forgot to document this in the section on array metadata. I meant to add to the section on array metadata an example like this: {
"shape": [10000, 1000],
"data_type": "<f8",
"chunk_grid": {
"type": "regular",
"chunk_shape": [1000, 100],
"separator": "/"
},
...
} In this example the separator "/" is given, which would mean that, e.g., a chunk with grid indices (1, 2, 3) would have chunk identifier "1/2/3". If this chunk was part of an array at path "/foo/bar" then this chunk would have the storage key "data/foo/bar/1/2/3". |
Thanks, I think we want to be careful with arbitrary separator, in this section I thought about things like |
I think here we need to be careful about terminology and make a distinction between (node) paths and storage keys. These are different concepts. I.e., a zarr hierarchy comprises a set of nodes (arrays and groups). Each node has a path which uniquely identifies it. Chunks are not nodes. Chunks don't have paths. Similarly, metadata documents are not nodes, they don't have paths. Only arrays and groups have paths. Chunks and metadata documents have storage keys. Here's a summary of these two levels:
Because paths and storage keys are separate concepts, changing the chunk identifier separator does not impact on the processing or handling of paths in any way. It only impacts on the code that constructs storage keys for chunks in order to store or retrieve specific chunks. E.g., given an array with path "/foo/bar", the v3 protocol gives rules for constructing the metadata and chunk storage keys for that array. The array metadata document will be stored using the storage key "meta/root/foo/bar.array". A chunk with grid indices (1, 2, 3) would be stored using the storage key "data/foo/bar/1.2.3" by default, or alternatively "data/foo/bar/1/2/3" if the chunk identifier separator is set to "/". So, if a zarr implementation is given a path "/foo/bar" but doesn't know whether an array or group exists, it would query the set of metadata keys present in the store. If the key "meta/root/foo/bar.array" is present, the path identifies an array. Similarly, if an implementation was given a path like "/foo/bar/1/2/3", it would check for the presence of a metadata document with key "meta/root/foo/bar/1/2/3.array". If the key is not in the store, no array exists. If the store also contains an array with path "/foo/bar" which has chunk keys like "data/foo/bar/1/2/3" there is no confusion, because chunk keys are never used to infer anything about the hierarchy such as what nodes are present. Only metadata keys are used for that purpose. Hope that's vaguely clear, happy to try and elaborate. |
Yes there would need to be some sensible restrictions on allowed separators. |
E.g., "0" would be bad because you could get chunk identifier clashes like...
|
I think that I missed the fact you only had check the metadata store for the existence of an array. So I agree I think we should be fine with allowing / as separator. And yes 0 was s specific bad example chosen on purpose. |
Sorry to have missed the call, but to just cross-link myself from zarr-developers/zarr-python#395 (comment), I'd vote for having |
I would have no objection to "/" as the default. |
I fully agree that "/" should be the default. Actually this is one of the reasons I still fall back to n5 most of the time, because it makes (default) zarr not really usable with large files. |
From @alimanfoo on today's meeting.
Can we add the option to use
/
as as separator instead of.
in spec v3 ?This comme from the potential need on filesystem to limit the number of files in a directory, and using
/
would help. Even if it could be made store only this may need to be exposed through the protocol as other tools may be copying data from a filesystem to a cloud store.My main concern would then be the difficulty to know wether a path either given or returned from a store is an implict group, array or chunk.
Say we have the following path:
/g1/g2/g3/a1/0.1.2.3.4
; it would become/g1/g2/g3/a1/0/1/2/3/4
where I use here for clarityg
for groups,a
for array and numbers for chunks.I think that a store or protocol implementation may have the query at least
/meta/g1/g2/g3/a1/0/1/2/3/4/.array
, to/meta/g1/g2/g3/a1/.array
potentially in order to know how to interpret a key.For example this may complicate things like
list_dir()
that now needs some understanding of the structure. Listing/g1/g2/g3/a1/0/
should typically fail (it is not really a dir/group), unless we walk up to find the.array
.Listing
/g1/g2/g3/a1/
would return all the chunks in case where separator is.
, only the number of chunks along the first dimension if separator is/
, unless it is made aware of the separator.This can likely be taken care of in the protocol, though that feel like extra complexity.
The text was updated successfully, but these errors were encountered: