-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: deterministic Zarr checksum #206
feat: deterministic Zarr checksum #206
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to open this PR, @hmacdope. You're one of the first external contributors to Polaris and we appreciate your interest in the project.
I've left a comment regarding the use of os.walk
. While it would ensure deterministic chunk ordering, it could significantly increase memory usage by loading all file names into memory. Ideally, we’re looking for a solution that avoids this, which I understand can be tricky.
Let me know what you think and thanks again for the contribution.
root_path: The path to the root of the directory to walk | ||
""" | ||
|
||
for dir_path, dirs, file_names in os.walk(root_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main blocker here is that file_names
will be held in-memory for each directory that is traversed. Depending on the size of the dataset and the chunking mechanism/pattern chosen by the dataset creator, it is possible to have a directory which contains enough files at a single level to exhaust one's system memory.
We actually explored this idea during our first implementation. Apologies, I probably should've explained this more clearly in the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no worries at all! I see why this would be problematic with all the zarr
chunks. Feel free to close this PR as it won't address the issue.
While researching for this issue I did come accross zarr-checksum
that looks like it does what you need: https://github.com/dandi/zarr_checksum/tree/master and seems to have some of the zarr
maintainers involved. If you think that looks like an acceptable solution then perhaps I could give that a try? Perhaps important question is whether the manifest file system is deployed (do we need to support two architectures?). Keen to hear your thoughts.
I would love to keep contributing! If there are any issues I spy that I can help with. I also would love to contribute to XL competitions and benchmarks support (with my ASAPDiscovery hat on), so when the time comes I will try and help out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building for the scale we hope to support has made for some interesting but tricky problems! It seems like Cas shared some additional context here that I hope clarifies this checksum work a bit.
As for your other questions, the first version of the manifest file system is deployed and being used for V1 datasets which leverage Zarr columns. We're actively developing the second version of the system for V2 datasets, and it will diverge quite strongly in how it works relative to the current version. We plan to maintain both systems for the foreseeable future as deprecating the current system would make it challenging to ensure backward/forward compatibility.
Given this ongoing work, I'm going to close this PR, but we're looking forward to working together when you get the time! Thanks again for taking the time to work on this, and if you ever have any questions or need additional context, please feel free to reach out.
Changelogs
recursively_build_manifest
to yield deterministic ordering of the Zarr manifest file and corresponding MD5 checksum.Checklist:
feature
,fix
,chore
,documentation
ortest
(or ask a maintainer to do it for you).Fixes #188
This should (hopefully) take care of file ordering messing up checksums for
zarr
manifests.A specific case where you have encountered non-determinism may be helpful, as I struggled to generate a failing counterexample on
main
.