Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links between groups #297

Open
mrocklin opened this issue Aug 31, 2018 · 8 comments
Open

Links between groups #297

mrocklin opened this issue Aug 31, 2018 · 8 comments
Labels
enhancement New features or improvements

Comments

@mrocklin
Copy link
Contributor

Culturally some groups like to organize datasets day-by-day. This makes it easy to append new data by just dropping it into a directory.

Culturally other groups like to organize datasets as large monoliths. This makes it easy to manage large logical collections simply.

Is there a way to do both by having separate metadata files that both point to the same collections of bytes?

Similarlly, I might want a logical dataset that points to the most recent day of data. Ideally I could have single metadata file in one location that would contain a relative path that could change day by day.

I suspect that the answer to these questions today is "no, you can not do this. Zarr expects blocks to be in a certain location". However, I suspect that this might be doable if we were to extend metadata entries with an optional relative path to prepend to data key locations.

@alimanfoo
Copy link
Member

alimanfoo commented Aug 31, 2018 via email

@ghost
Copy link

ghost commented Aug 31, 2018

Here is my understanding of the problem. Take some zarr store, a.zarr.
Every day, some application writes some data to a.zarr. However, it
groups the data together by the date on which it was written. We may
have have groups like /2018/08/30, for example. What @mrocklin seems
to be proposing is having multiple metadata files that "transmute" the
user-facing appearance of a.zarr. Suppose we also had b.zarr and
c.zarr, two stores that refer to a.zarr for data. However, b.zarr
specifies in its metadata that it shows the "latest" data entries
(/2018/08/30, e.g.), while c.zarr "flattens" all of the data in
a.zarr to appear as though everything were under the root group.

@mrocklin Please let me know if I have misunderstood your proposal.

@mrocklin
Copy link
Contributor Author

Yes, I think that that's more or less equivalent. I'll try summarizing it from the other direction.

Lets say that an automated process wants to dump a data file into a directory every day. They've chosen to store that as Zarr. However, in order to avoid mucking about with metadata they've chosen to just dump a new file every day.

$ ls *.zarr
2018-01-01.zarr
2018-01-02.zarr
2018-01-03.zarr
2018-01-04.zarr

However some of our scientific users don't want to manage this as many small zarr datasets, they are willing to make a metadata file around this data after the fact to represent it as one giant dataset. They create a new logical zarr dataset that contains only metadata. That metadata points to the pre-existing data contained in the other files:

$ ls *.zarr
2018-01-01.zarr
2018-01-02.zarr
2018-01-03.zarr
2018-01-04.zarr
all.zarr

$ ls all.zarr
.meta.json

$ cat all.zarr/.meta.json
{ ...
  { 
    ...
    relative_path: '../2018-01-01.zarr/'
  } 
  { 
    ...
    relative_path: '../2018-01-02.zarr/'
  }
  ...
}

Similarly a workload wants a zarr dataset that is also just a metadata file, but that metadata file points to the latest day of data

$ ls *.zarr
2018-01-01.zarr
2018-01-02.zarr
2018-01-03.zarr
2018-01-04.zarr
all.zarr
latest.zarr

@alimanfoo
Copy link
Member

Thanks Matt. Here's a gist with some possibilities. In a nutshell, to be able to view all the data together, a user could (option 1) open the parent directory via zarr, effectively turning the parent directory into a group, or (option 2) use file system (hard) links. To get a "latest" dataset you could use hard links. You can't use symbolic links currently as zarr DirectoryStore does not dereference them, although this could probably be changed. Note that these solutions are specific to using a zarr DirectoryStore, they may not apply to other types of store.

Re the suggestion to include links within the zarr metadata, this is probably harder to do as it would need to be generalised to account for different types of store. I.e., could not assume DirectoryStore.

Note that these types of features sound very similar to what HDF5 provides via links. I believe there are "hard", "soft" and "external" links, see h5py docs on links. I had been trying to avoid implementing links within zarr, just to keep things simple, and because this requirement can be achieved at the file system level (if using DirectoryStore). But happy to discuss if the file system solution is not sufficient.

@alimanfoo
Copy link
Member

alimanfoo commented Sep 3, 2018

Thanks @onalant. I've added some more examples to this gist to show how your example could be done with hard links. Again not saying this is a perfect solution, just illustrating a possibility.

@alimanfoo
Copy link
Member

P.S. @mrocklin do you mind if I rename this issue something like "links between groups"?

@mrocklin
Copy link
Contributor Author

mrocklin commented Sep 4, 2018

Fine by me

@dstansby
Copy link
Contributor

It seems like this is more of an extension to the zarr spec as opposed to something we'd want to implement just in the python implementation, so I'll move this over to zarr-specs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

3 participants