-
-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Links between groups #297
Comments
Hi Matt, I'm not quite grokking. Could you given an example?
…--
If I do not respond to an email within a few days, please feel free to
resend your email and/or contact me by other means.
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
|
Here is my understanding of the problem. Take some zarr store, @mrocklin Please let me know if I have misunderstood your proposal. |
Yes, I think that that's more or less equivalent. I'll try summarizing it from the other direction. Lets say that an automated process wants to dump a data file into a directory every day. They've chosen to store that as Zarr. However, in order to avoid mucking about with metadata they've chosen to just dump a new file every day.
However some of our scientific users don't want to manage this as many small zarr datasets, they are willing to make a metadata file around this data after the fact to represent it as one giant dataset. They create a new logical zarr dataset that contains only metadata. That metadata points to the pre-existing data contained in the other files:
Similarly a workload wants a zarr dataset that is also just a metadata file, but that metadata file points to the latest day of data
|
Thanks Matt. Here's a gist with some possibilities. In a nutshell, to be able to view all the data together, a user could (option 1) open the parent directory via zarr, effectively turning the parent directory into a group, or (option 2) use file system (hard) links. To get a "latest" dataset you could use hard links. You can't use symbolic links currently as zarr DirectoryStore does not dereference them, although this could probably be changed. Note that these solutions are specific to using a zarr DirectoryStore, they may not apply to other types of store. Re the suggestion to include links within the zarr metadata, this is probably harder to do as it would need to be generalised to account for different types of store. I.e., could not assume DirectoryStore. Note that these types of features sound very similar to what HDF5 provides via links. I believe there are "hard", "soft" and "external" links, see h5py docs on links. I had been trying to avoid implementing links within zarr, just to keep things simple, and because this requirement can be achieved at the file system level (if using DirectoryStore). But happy to discuss if the file system solution is not sufficient. |
Thanks @onalant. I've added some more examples to this gist to show how your example could be done with hard links. Again not saying this is a perfect solution, just illustrating a possibility. |
P.S. @mrocklin do you mind if I rename this issue something like "links between groups"? |
Fine by me |
It seems like this is more of an extension to the zarr spec as opposed to something we'd want to implement just in the python implementation, so I'll move this over to zarr-specs. |
Culturally some groups like to organize datasets day-by-day. This makes it easy to append new data by just dropping it into a directory.
Culturally other groups like to organize datasets as large monoliths. This makes it easy to manage large logical collections simply.
Is there a way to do both by having separate metadata files that both point to the same collections of bytes?
Similarlly, I might want a logical dataset that points to the most recent day of data. Ideally I could have single metadata file in one location that would contain a relative path that could change day by day.
I suspect that the answer to these questions today is "no, you can not do this. Zarr expects blocks to be in a certain location". However, I suspect that this might be doable if we were to extend metadata entries with an optional relative path to prepend to data key locations.
The text was updated successfully, but these errors were encountered: