Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc for Zarr versioning/publishing support via Zarr Manifest Files #1892

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Mar 18, 2024

Unresolved issues:

  • TODO: add reference to CSV with sample sizes of the manifests https://gist.github.com/jwodder/4e9c6e846639b6d5be2b9ab7f8302166

  • The design of the fields key and entry arrays (copied from YOH's prior art) is odd. @jwodder does not foresee any circumstances in which fields would have anything other than the recommended value (and fields values other than the recommended would be tricky to support in dandidav) but @yarikoptic insists on an explicit description some way and "future proofing", so we might want to change the manifest file format in one of the following ways:

    • Eliminate the fields key and define the elements of entry arrays to always be version ID, last modified, size, and ETag in that order

      • Con: Manifest files become harder to read for people not sufficiently familiar with the format definition
    • Eliminate the fields key and change the entry arrays to objects with versionId, lastModified, size, and ETag fields

      • Con: Much larger manifest files
    • Eliminate the fields key and define the elements of entry arrays to always be version ID, last modified, size, and ETag in that order; while also adding @schema URL which would point to a versioned jsonschema for the manifest file which would describe those fields.

  • Describe Archive behavior when publishing Dandisets with Zarrs (See comments below)

    • When a Dandiset is published, presumably the Zarr IDs of the Zarrs in the published version and in the draft version will have to diverge at some point — either at the moment of publication or when each draft Zarr is first changed after publication. When exactly will this happen? Which version gets the "old" Zarr IDs?
      • @yarikoptic: When dandiset published, all assets with all metadata records are "frozen", and we have access to the corresponding zarr checksums. So it would be the same zarr id, and corresponding checksum for that published dandiset version.
        Then whenever Zarr is modified, it gets a new .version and some new asset would get that .zarr_version, possibly published later.
      • How can we reduce the burden on backups2datalad when backing up a published Zarr? If done naïvely, the program will have to recreate already-backed up Zarrs as soon as the Zarr IDs "split."
        • @yarikoptic : if we somehow record (git note? in git commit message description like datalad run) the zarr checksum for the commit, we can discover/tag the desired "released" commit. Possible gotcha: we missed that moment and have no commit of such kind -- then recreate from closest commit based on date in a branch off the main trunk.
  • DANDI API changes necessary to support zarr workflows

  • Changes to metadata model (dandischema)

    • Adjustments to contentUrl for Zarrs? (See comments below; might be not needing model change)
  • Address garbage collection of old Zarr versions

  • consider caching/containment of manifests within DB itself (as well).

CC @yarikoptic @dandi/dandiarchive

@jwodder jwodder added design-doc Involves creating or discussing a design document zarr Issues with Zarr hosting/processing/etc. labels Mar 18, 2024
@yarikoptic
Copy link
Member

  • Do we want the Archive to store separate manifest files for each version of each Zarr?

Yes. This is one of the main motivations behind this feature. So we could "cheaply" provide support for versioning of zarrs and this way then allow for their releases: ATM nobody can publish/release a versioned dandiset if there is zarr in it. If we have specific version manifest for zarr -- we can get to that manifest / version no problem and thus can release zarr and access that specific version of zarr.

@jwodder
Copy link
Member Author

jwodder commented Mar 18, 2024

@yarikoptic How should manifests for different versions of the same Zarr be organized in S3? Naming the files after the Zarr checksum like in your repository isn't guaranteed to be stable, as a user could create a Zarr, then change one entry, then change that entry back to its original contents, resulting in an S3 object with three versions while there are only two different checksums for the Zarr across its history.

was already generated, the newer file shall replace the older.

Manifest files shall also be generated for all Zarrs already in the Archive
when this feature is first implemented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After versions support added above, here should follow description on what is to happen for publishing dandisets which have zarrs in them.

We need to review/analyze what should now happen for zarr records or assets so we capture version (checksum) information for a zarr whenever it becomes part of the released dandiset. In case of blobs it is easy since blobs are not mutable. But with zarrs, since zarr could have multiple versions - we would need to make sure that published asset has versionId for zarr which would not be changed, whenever that asset zarr would be modified in draft version.

@yarikoptic
Copy link
Member

then change that entry back to its original contents, resulting in an S3 object with three versions while there are only two different checksums for the Zarr across its history.

I think that should be fine -- since we would just carry about a manifest with a specific checksum of the content. So pretty much to have similarly to blobs a "context addressable storage" (of zarrs). Then given zarr id + checksum (as stored in DB, or in assets dumps which would have zarr id and its checksum) we can get needed manifest and provide access to that particular version, possibly the same one across different points in the version history.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a great start, but now needs to come a hard part - analysis of current API/behavior and proposal on how to augment it so we have clear association between dandiset asset and version of zarr for an asset, and what would happen on publish, and how people (and web UI) would provide access to published versions of dandisets with corresponding versions of zarrs

doc/design/zarr-manifests.md Show resolved Hide resolved
doc/design/zarr-manifests.md Show resolved Hide resolved
@jwodder jwodder force-pushed the zarr-manifest-design branch from f1b4a62 to 82eb89e Compare March 22, 2024 13:04
- Document needed changes to dandidav?
- The bucket for the Archive instance will now be given on the command line (only required if a custom/non-default API URL is given)
- The bucket's region will have to be looked up & stored before starting the webserver
- Zarrs under `/dandisets/` will no longer determine their S3 location via `contentUrl`; instead, they will combine the Archive's bucket & region with the Zarr ID in the asset properties (templated into "zarr/{zarr_id}/")
Copy link
Member

@yarikoptic yarikoptic Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to ourselves: requests to modify a Zarr from a previous non draft version might be "hard to impossible" since would cause "race condition" between different versions pretty much if modified in parallel or otherwise very inefficient since would require large "diff" uploads. Pretty much it would boil down to have the zarr in its mutable form assigned to just a single path in a single dandiset (like now), as it must not then be changed from multiple dandisets/locations. But then it could still reside in multiple dandisets though, and even published in that original version!

Alternatives (just thinking out loud):

  • operations on Zarr would operate as "patch" operations on specific version (manifest) as to simply provide a new key + versionid + ... on S3 and modify prior manifest with finalize saving patched manifest without doing full sweep of the bucket. cons: a more complex implementation (??? may be not) - zarr operation must be completed "in full"; version of zarr on s3 "as is" might not be a legit zarr to be used directly if ever modified for multiple versions; need to be thought through better; pros: support to modify any version of zarr; (much) more efficient finalize since would just modify prior manifest with changes without doing full sweep of the prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another "mind strike" which relates to above: we need association of zarr to a dandiset for editing to ensure ownership/rights to modify which is somewhat different for blobs that we do not allow modifications, thus people just upload new ones. Overall feels like we need some way to distinguish a "canonical asset of a zarr" (which can still be modified) ... more thinking needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic

requests to modify a Zarr from a previous non draft version

What are you talking about? A "non draft version" is a published version, and published versions and their contents can't be modified.

we need association of zarr to a dandiset

Each Zarr is already associated with a Dandiset. You can see a Zarr's Dandiset by requesting /zarr/{zarr_id}/.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the idea/hope was that we can "break" the need for association with a particular dandiset. Then the same zarr could be present in multiple dandisets, and thus versions of "draft" version could diverge in two dandisets and changes in one dandiset to the same zarr could "race" with changes in another dandiset.

Problem indeed should not manifest itself if we keep zarr associated with just a single dandiset. But that is "suboptimal" since would disallow (cheap) creating dandisets with assets (content) from another. And we already had such use cases. At least for "read-only" "mix-in" of zarrs from other dandisets... so we should see if we could support that through these proposed changes.


* Publishing Zarrs: Just ensure that the `zarr_version` in Zarr assets is frozen and that no entries/S3 object versions from the referenced version are ever deleted ?

* Does garbage collection of old Zarr versions need to be discussed?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should indeed be touched on. And we should touch on "trailing delete" to be disabled for /zarr/ prefix on s3.

@yarikoptic yarikoptic self-assigned this Apr 30, 2024
@yarikoptic yarikoptic changed the title Design doc for generating Zarr Manifest Files Design doc for Zarr versioning/publishing support via Zarr Manifest Files Aug 13, 2024
Two "end-points" within that namespace are provided:

- [webdav.dandiarchive.org/zarrs](https://webdav.dandiarchive.org/zarrs) -- all Zarrs across all dandisets, possibly with multiple versions. E.g. see [zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62) which ATM has 3 versions.
- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic Why is this mentioned here? The /dandisets/ hierarchy gets its information directly from S3; it does not use the Zarr manifest files.

Copy link
Member

@yarikoptic yarikoptic Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! thanks for spotting, good to know since I assumed that this one gets it also from the manifests! So we need smth like

Suggested change
- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/`. E.g. for aforementioned Zarr - https://webdav.dandiarchive.org/dandisets/000026/draft/sub-I48/ses-SPIM/micr/sub-I48_ses-SPIM_sample-BrocaAreaS09_stain-Somatostatin_SPIM.ome.zarr/ -- a specific version (the latest, currently [6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836](https://webdav.dandiarchive.org/zarrs/057/f84/057f84d5-a88b-490a-bedf-06f3f50e9e62/6efea0a8e95e67ecb5af7aa028dece14-18147--30560865836.zarr/)).
- [webdav.dandiarchive.org/dandisets/](https://webdav.dandiarchive.org/dandisets/)`{dandiset_id}/{version}/{path}/` although providing access to Zarr files, it **does not** uses manifest files but gets listing directly from S3, so points only to the most recent version (possibly not even finalized yet during upload).

?

But with that "not finalized" aspect, we might (not yet sure) want to change that webdav behavior and also use manifest file here to ensure access to the legit version of zarr (not finalized might be partial etc).

@yarikoptic yarikoptic force-pushed the zarr-manifest-design branch from 9d425c9 to 6a1a1d9 Compare August 20, 2024 20:39
jwodder and others added 8 commits August 27, 2024 14:38
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
Signed-off-by: Yaroslav Halchenko <debian@onerussian.com>
@yarikoptic yarikoptic force-pushed the zarr-manifest-design branch from 6a1a1d9 to 1e15c75 Compare August 27, 2024 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-doc Involves creating or discussing a design document zarr Issues with Zarr hosting/processing/etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants