You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.
In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in
but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:
for a zarr archive on S3 we collect "manifest" file with S3 key versionIds and ETags (checksums) for all files/keys present in the Zarr at current version
based on individual ETags we can compute deterministic etag for the entire zarr archive at any given version.
upon changes to Zarr (including deletions) new manifest is produced with name corresponding to the new checksum (so it is like your git tree object pointing to individual file/subtrees objects)
given a manifest for a specific version of Zarr we can redirect to specific versioned URLs on S3 thus providing access to that particular version of Zarr.
more efficient access, without redirects, could be implemented with custom Zarr access libraries or at e.g. fsspec level.
But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?
The text was updated successfully, but these errors were encountered:
yarikoptic
changed the title
FTR: a solution for versioned Zarrs based on versioned S3 bucket
RFC: a solution for versioned Zarrs based on versioned S3 bucket
Sep 17, 2024
Inspired by
I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.
In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in
but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:
To show feasibility of such approach we provide
But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?
The text was updated successfully, but these errors were encountered: