Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to use / as as separator instead of . in spec v3. #78

Closed
Carreau opened this issue Jun 10, 2020 · 9 comments
Closed

Option to use / as as separator instead of . in spec v3. #78

Carreau opened this issue Jun 10, 2020 · 9 comments
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@Carreau
Copy link
Contributor

Carreau commented Jun 10, 2020

From @alimanfoo on today's meeting.

Can we add the option to use / as as separator instead of . in spec v3 ?
This comme from the potential need on filesystem to limit the number of files in a directory, and using / would help. Even if it could be made store only this may need to be exposed through the protocol as other tools may be copying data from a filesystem to a cloud store.

My main concern would then be the difficulty to know wether a path either given or returned from a store is an implict group, array or chunk.

Say we have the following path:

/g1/g2/g3/a1/0.1.2.3.4; it would become /g1/g2/g3/a1/0/1/2/3/4 where I use here for clarity g for groups, a for array and numbers for chunks.

I think that a store or protocol implementation may have the query at least /meta/g1/g2/g3/a1/0/1/2/3/4/.array, to /meta/g1/g2/g3/a1/.array potentially in order to know how to interpret a key.

For example this may complicate things like list_dir() that now needs some understanding of the structure. Listing /g1/g2/g3/a1/0/ should typically fail (it is not really a dir/group), unless we walk up to find the .array.

Listing /g1/g2/g3/a1/ would return all the chunks in case where separator is ., only the number of chunks along the first dimension if separator is /, unless it is made aware of the separator.

This can likely be taken care of in the protocol, though that feel like extra complexity.

@alimanfoo
Copy link
Member

Thanks @Carreau.

So I originally intended to add this feature into the initial v3 spec draft, but it would seem that I only got half-way there. E.g., in the section on regular chunk grids I wrote:

The identifier for chunk with grid index (i, j, k, …) is formed by joining together ASCII string representations of each index using a separator. The default value for the separator is the period character (“.”), but this may be configured by providing a separator value within the chunk_grid metadata object, see the section on Array metadata below.

But then I forgot to document this in the section on array metadata.

I meant to add to the section on array metadata an example like this:

{
    "shape": [10000, 1000],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100],
        "separator": "/"
    },
    ...
}

In this example the separator "/" is given, which would mean that, e.g., a chunk with grid indices (1, 2, 3) would have chunk identifier "1/2/3". If this chunk was part of an array at path "/foo/bar" then this chunk would have the storage key "data/foo/bar/1/2/3".

@Carreau
Copy link
Contributor Author

Carreau commented Jun 10, 2020

Thanks, I think we want to be careful with arbitrary separator, in this section I thought about things like - instead of ., asn assumed that the chunk name was still a valid node name (/ is not allowed in node). Like can you use 0 as a separator ? I'll try to find out how much that affect implmnetations.

@alimanfoo
Copy link
Member

alimanfoo commented Jun 10, 2020

My main concern would then be the difficulty to know wether a path either given or returned from a store is an implict group, array or chunk.

I think here we need to be careful about terminology and make a distinction between (node) paths and storage keys. These are different concepts.

I.e., a zarr hierarchy comprises a set of nodes (arrays and groups). Each node has a path which uniquely identifies it.

Chunks are not nodes. Chunks don't have paths. Similarly, metadata documents are not nodes, they don't have paths. Only arrays and groups have paths.

Chunks and metadata documents have storage keys.

Here's a summary of these two levels:

  • arrays and groups (nodes) - paths
  • chunks and metadata documents - storage keys

Because paths and storage keys are separate concepts, changing the chunk identifier separator does not impact on the processing or handling of paths in any way. It only impacts on the code that constructs storage keys for chunks in order to store or retrieve specific chunks.

E.g., given an array with path "/foo/bar", the v3 protocol gives rules for constructing the metadata and chunk storage keys for that array. The array metadata document will be stored using the storage key "meta/root/foo/bar.array". A chunk with grid indices (1, 2, 3) would be stored using the storage key "data/foo/bar/1.2.3" by default, or alternatively "data/foo/bar/1/2/3" if the chunk identifier separator is set to "/".

So, if a zarr implementation is given a path "/foo/bar" but doesn't know whether an array or group exists, it would query the set of metadata keys present in the store. If the key "meta/root/foo/bar.array" is present, the path identifies an array.

Similarly, if an implementation was given a path like "/foo/bar/1/2/3", it would check for the presence of a metadata document with key "meta/root/foo/bar/1/2/3.array". If the key is not in the store, no array exists.

If the store also contains an array with path "/foo/bar" which has chunk keys like "data/foo/bar/1/2/3" there is no confusion, because chunk keys are never used to infer anything about the hierarchy such as what nodes are present. Only metadata keys are used for that purpose.

Hope that's vaguely clear, happy to try and elaborate.

@alimanfoo
Copy link
Member

Thanks, I think we want to be careful with arbitrary separator, in this section I thought about things like - instead of ., asn assumed that the chunk name was still a valid node name (/ is not allowed in node). Like can you use 0 as a separator ? I'll try to find out how much that affect implmnetations.

Yes there would need to be some sensible restrictions on allowed separators.

@alimanfoo
Copy link
Member

Yes there would need to be some sensible restrictions on allowed separators.

E.g., "0" would be bad because you could get chunk identifier clashes like...

  • (100, 1, 0) -> "1000100"
  • (1, 0, 100) -> "1000100"

@Carreau
Copy link
Contributor Author

Carreau commented Jun 10, 2020

I think that I missed the fact you only had check the metadata store for the existence of an array. So I agree I think we should be fine with allowing / as separator.

And yes 0 was s specific bad example chosen on purpose.

@joshmoore
Copy link
Member

Sorry to have missed the call, but to just cross-link myself from zarr-developers/zarr-python#395 (comment), I'd vote for having / as the default. I'm now up to 10s of millions of chunks in a single directory which for non-cloud purposes is essentially a non-starter. @alimanfoo's comment above seems to say that shouldn't be an issue, but just in case it might be, I thought I'd raise here.

@alimanfoo
Copy link
Member

I would have no objection to "/" as the default.

@constantinpape
Copy link

I fully agree that "/" should be the default. Actually this is one of the reasons I still fall back to n5 most of the time, because it makes (default) zarr not really usable with large files.
(I know that there's NestedDirectoryStore, but it would be still be very nice to have this by default.)

@Carreau Carreau closed this as completed Sep 25, 2020
@Carreau Carreau added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
None yet
Development

No branches or pull requests

4 participants