Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: deterministic Zarr checksum #206

Closed

Conversation

hmacdope
Copy link

@hmacdope hmacdope commented Oct 8, 2024

Changelogs

  • Sort the directories of the dataset before they are walked by recursively_build_manifest to yield deterministic ordering of the Zarr manifest file and corresponding MD5 checksum.
  • Add test for recreating manifest file multiple times, checking its checksum

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

Fixes #188

  • This should (hopefully) take care of file ordering messing up checksums for zarr manifests.

  • A specific case where you have encountered non-determinism may be helpful, as I struggled to generate a failing counterexample on main.

@hmacdope hmacdope requested a review from cwognum as a code owner October 8, 2024 10:20
Copy link
Contributor

@Andrewq11 Andrewq11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to open this PR, @hmacdope. You're one of the first external contributors to Polaris and we appreciate your interest in the project.

I've left a comment regarding the use of os.walk. While it would ensure deterministic chunk ordering, it could significantly increase memory usage by loading all file names into memory. Ideally, we’re looking for a solution that avoids this, which I understand can be tricky.

Let me know what you think and thanks again for the contribution.

root_path: The path to the root of the directory to walk
"""

for dir_path, dirs, file_names in os.walk(root_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main blocker here is that file_names will be held in-memory for each directory that is traversed. Depending on the size of the dataset and the chunking mechanism/pattern chosen by the dataset creator, it is possible to have a directory which contains enough files at a single level to exhaust one's system memory.

We actually explored this idea during our first implementation. Apologies, I probably should've explained this more clearly in the issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no worries at all! I see why this would be problematic with all the zarr chunks. Feel free to close this PR as it won't address the issue.

While researching for this issue I did come accross zarr-checksum that looks like it does what you need: https://github.com/dandi/zarr_checksum/tree/master and seems to have some of the zarr maintainers involved. If you think that looks like an acceptable solution then perhaps I could give that a try? Perhaps important question is whether the manifest file system is deployed (do we need to support two architectures?). Keen to hear your thoughts.

I would love to keep contributing! If there are any issues I spy that I can help with. I also would love to contribute to XL competitions and benchmarks support (with my ASAPDiscovery hat on), so when the time comes I will try and help out.

Copy link
Contributor

@Andrewq11 Andrewq11 Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building for the scale we hope to support has made for some interesting but tricky problems! It seems like Cas shared some additional context here that I hope clarifies this checksum work a bit.

As for your other questions, the first version of the manifest file system is deployed and being used for V1 datasets which leverage Zarr columns. We're actively developing the second version of the system for V2 datasets, and it will diverge quite strongly in how it works relative to the current version. We plan to maintain both systems for the foreseeable future as deprecating the current system would make it challenging to ensure backward/forward compatibility.

Given this ongoing work, I'm going to close this PR, but we're looking forward to working together when you get the time! Thanks again for taking the time to work on this, and if you ever have any questions or need additional context, please feel free to reach out.

@Andrewq11 Andrewq11 closed this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Generate a deterministic checksum for DatasetV2 Zarr manifests
2 participants