-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array nesting: Add the ability to use N5-style nested layout #17
Conversation
Having many millions of chunk files in a single directory causes significant performance issues for filesystems. This PR introduces a "nested" boolean to the ZarrArray class and related helpers in order to choose between the separators "." and "/". The main downside of nested storage is that one must search through the chunk index names in order to determine whether or not an array is nested.
If a ZarrArray is created but no data is written (as with bioformats2raw), then the subsequent open will fail since no chunks can be found for a proper determination.
This commits uses a `Boolean` for storing the nested state to detect whether or not a non-default value was requested. If so, then the value is stored in the .zarray metadata and will be detected on opening the array to prevent the time- consuming workaround.
Pushed a fairly significant change to store the state of the "nested" |
Make use of the new ZarrArray.nested flag that's currently open as a PR. This should significantly increase the performance of reading existing zarrs (which also happens during downsampling) when the number of chunk files reaches the millions. see: bcdev/jzarr#17
Dear Josh, I have written down my thoughts on this topic here: #19 Best Regards |
By the way ... Have a great evening! |
Thanks for the headsup, @SabineEmbacher! I'll do some more testing on my side and get things tidied up. (After reading #19...) |
There are several ideas in #19 that I still need to think through, but for the implementation in this PR, I'm leaning towards changing |
Dear Josh, What we are talking about is the index separator char which shall be used to generate the chunk key. So this character in principle is not equal to a path separator but can be the same. I plan to implement the following: When creating a zarr array:
When opening an array:
If you agree with this plan, you don't have to do anything more to the pull request. I will then implement it as planned in the near future. Best Regards |
True, and to be honest, I don't know how the Python implementations are dealing with this cross-platform!
Sounds great. I will summarize your proposal on the Python side and we'll see if there are any further suggestions (e.g. for the name). All the best, |
Path separator consistency is one area which could possibly improved when working with jzarr. The Zarr specification is already quite explicit when it comes to "key" uniformity expectations (basically UNIX style path semantics): Everything else is left up to the implementation and corresponding storage. Consequently, there is probably utility in establishing consistency and uniformity across the jzarr API when it comes to group keys in particular. Notably, there is currently |
Good Point! |
done |
Modify nested detection loop
Unidata repository
Having many millions of chunk files in a single directory causes
significant performance issues for filesystems. This PR introduces
a "nested" boolean to the ZarrArray class and related helpers in
order to choose between the separators "." and "/". The main
downside of nested storage is that one must search through the
chunk index names in order to determine whether or not an array
is nested.
Discussion:
A few similar strategies exist in the zarr-python code base. NestedStorage was the original attempt with the downside that it was not composable with other stores. A newer version, FSStore, allows passing a "key_separator". Here, I've chosen the boolean to reduce the burden on the caller. A type other than "boolean" ("ChunkNamer"?) might be preferred. Note also that no other implementation that I know of currently tries to detect "nested or not nested" as we're doing here. It may be overkill.