Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: a solution for versioned Zarrs based on versioned S3 bucket #314

Open
yarikoptic opened this issue Sep 17, 2024 · 2 comments
Open

RFC: a solution for versioned Zarrs based on versioned S3 bucket #314

yarikoptic opened this issue Sep 17, 2024 · 2 comments

Comments

@yarikoptic
Copy link

yarikoptic commented Sep 17, 2024

Inspired by

I've decided to share ongoing design we are pursuing and seek for possible feedback and possibly guidance and/or collaboration.

In DANDI archive (https://dandiarchive.org/) where we use versioned S3 bucket for actual data storage, we are also working to allow for versioning of Zarr filesets. Notes on the ultimate design could be found in

but in a nutshell it is centered around simple aspects of S3 versioned bucket, checksum over files in a Zarr and collecting a "manifest" file with information about keys/versionIds for a given version of Zarr (so ideas similar to git itself). In more detail:

  • for a zarr archive on S3 we collect "manifest" file with S3 key versionIds and ETags (checksums) for all files/keys present in the Zarr at current version
  • based on individual ETags we can compute deterministic etag for the entire zarr archive at any given version.
  • upon changes to Zarr (including deletions) new manifest is produced with name corresponding to the new checksum (so it is like your git tree object pointing to individual file/subtrees objects)
  • given a manifest for a specific version of Zarr we can redirect to specific versioned URLs on S3 thus providing access to that particular version of Zarr.

To show feasibility of such approach we provide

But I wondered, is there a way or a need to possibly formalize some "zarr manifest" listing which could then be reused across solutions? I am not quite sure if it is at the level of storage transformers since IMHO it should be rather a specification on top of zarr instance, in comparison to the specification within zarr. WDYT?

@yarikoptic yarikoptic changed the title FTR: a solution for versioned Zarrs based on versioned S3 bucket RFC: a solution for versioned Zarrs based on versioned S3 bucket Sep 17, 2024
@rabernat
Copy link
Contributor

Hi @yarikoptic, thanks for sharing this! Looks cool!

FYI, we are planning on open sourcing the solution we have built at Earthmover later this fall.

@rabernat
Copy link
Contributor

Hi folks! We released our project! You can read all about it here: https://icechunk.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants