feat: deterministic Zarr checksum #206

hmacdope · 2024-10-08T10:20:57Z

Changelogs

Sort the directories of the dataset before they are walked by recursively_build_manifest to yield deterministic ordering of the Zarr manifest file and corresponding MD5 checksum.
Add test for recreating manifest file multiple times, checking its checksum

Checklist:

Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

Fixes #188

This should (hopefully) take care of file ordering messing up checksums for zarr manifests.
A specific case where you have encountered non-determinism may be helpful, as I struggled to generate a failing counterexample on main.

Andrewq11

Thanks for taking the time to open this PR, @hmacdope. You're one of the first external contributors to Polaris and we appreciate your interest in the project.

I've left a comment regarding the use of os.walk. While it would ensure deterministic chunk ordering, it could significantly increase memory usage by loading all file names into memory. Ideally, we’re looking for a solution that avoids this, which I understand can be tricky.

Let me know what you think and thanks again for the contribution.

Andrewq11 · 2024-10-09T03:30:00Z

polaris/dataset/zarr/_manifest.py

+        root_path: The path to the root of the directory to walk
+    """
+
+    for dir_path, dirs, file_names in os.walk(root_path):


The main blocker here is that file_names will be held in-memory for each directory that is traversed. Depending on the size of the dataset and the chunking mechanism/pattern chosen by the dataset creator, it is possible to have a directory which contains enough files at a single level to exhaust one's system memory.

We actually explored this idea during our first implementation. Apologies, I probably should've explained this more clearly in the issue.

Ah no worries at all! I see why this would be problematic with all the zarr chunks. Feel free to close this PR as it won't address the issue.

While researching for this issue I did come accross zarr-checksum that looks like it does what you need: https://github.com/dandi/zarr_checksum/tree/master and seems to have some of the zarr maintainers involved. If you think that looks like an acceptable solution then perhaps I could give that a try? Perhaps important question is whether the manifest file system is deployed (do we need to support two architectures?). Keen to hear your thoughts.

I would love to keep contributing! If there are any issues I spy that I can help with. I also would love to contribute to XL competitions and benchmarks support (with my ASAPDiscovery hat on), so when the time comes I will try and help out.

Building for the scale we hope to support has made for some interesting but tricky problems! It seems like Cas shared some additional context here that I hope clarifies this checksum work a bit.

As for your other questions, the first version of the manifest file system is deployed and being used for V1 datasets which leverage Zarr columns. We're actively developing the second version of the system for V2 datasets, and it will diverge quite strongly in how it works relative to the current version. We plan to maintain both systems for the foreseeable future as deprecating the current system would make it challenging to ensure backward/forward compatibility.

Given this ongoing work, I'm going to close this PR, but we're looking forward to working together when you get the time! Thanks again for taking the time to work on this, and if you ever have any questions or need additional context, please feel free to reach out.

hmacdope added 6 commits October 8, 2024 20:31

first attempt

7fef021

print ds temp

0b4054e

I think this does the job

a9ead6f

add sorting of dir order

319ada3

check filecount

aa8aa96

lint

544b1f7

hmacdope requested a review from cwognum as a code owner October 8, 2024 10:20

ruff format

c472522

Andrewq11 reviewed Oct 9, 2024

View reviewed changes

Andrewq11 closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: deterministic Zarr checksum #206

feat: deterministic Zarr checksum #206

hmacdope commented Oct 8, 2024

Andrewq11 left a comment

Andrewq11 Oct 9, 2024

hmacdope Oct 13, 2024

Andrewq11 Oct 15, 2024 •

edited

Loading

feat: deterministic Zarr checksum #206

feat: deterministic Zarr checksum #206

Conversation

hmacdope commented Oct 8, 2024

Changelogs

Andrewq11 left a comment

Choose a reason for hiding this comment

Andrewq11 Oct 9, 2024

Choose a reason for hiding this comment

hmacdope Oct 13, 2024

Choose a reason for hiding this comment

Andrewq11 Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Andrewq11 Oct 15, 2024 •

edited

Loading