Consolidate zarr metadata into single key #268

martindurant · 2018-06-26T20:47:31Z

A simple possible way of scanning all the metadata keys ('.zgroup'...) in a dataset and copying them into a single key, so that on systems where there is a substantial overhead to reading small files, everything can be grabbed in a single read. This is important in the context of xarray, which traverses all groups during opening the dataset, to find the various sub-groups and arrays.

The test shows how you could use the generated key. We could contemplate automatically looking for the metadata key when opening.

REF: pangeo-data/pangeo#309

TODO:

alimanfoo · 2018-06-26T21:39:52Z

Nice proof of concept, simple and elegant.

jakirkham · 2018-06-27T03:45:27Z

Couple initial questions that come to mind:

What happens if the metadata changes?
What if multiple pieces of metadata change at the same time?

martindurant · 2018-06-27T13:24:04Z

@jakirkham , there is no mechanism here to update the consolidated metadata, it would need to be rescanned from the rest of the metadata files and rewritten - but the use case here is meant for write-once only. Of course, my approach is very simple.

alimanfoo · 2018-06-28T16:48:30Z

Yes the use case is for static data, once you know it's complete then you consolidate all metadata into a single file.

Two small thoughts...

I had wondered about compressing the consolidated metadata file. But then thought in a cloud setting this is unlikely to make a difference unless total size of consolidated metadata is above 10 Mb, which is unlikely unless people are cramming lots into user attributes. Typical size of a .zarray file is ~400 bytes.

Ultimately we'd need to think of some way that users are prevented from attempting to modify a group when using consolidated metadata. Under the current design further modifications would be permitted because the consolidated metadata has been read into a dict which allows modification, but these would obviously not get written back to the metadata file.

martindurant · 2018-06-28T16:56:09Z

There seem to be several possibilities, here are some thoughts.
It could be argues that implicitly loading from a metadata file should instantiate a dictstore that doesn't allow mutation at all (but data would still be writable); that there would be a flag at load time on whether to consider consolidated or not. Or it could be a metadata store that needs to have sync called, to persist both itself and also update the individual files.
Changes to the underlying directory structures would not, as things stand, get written to the consolidated store, and be out of sync until consolidate is explicitly called.
The presence of consolidated metadata could be used to indicate that the whole dataset hierarchy is read-only. Maybe then, opening can only be possible with some force parameter that deletes to consolidated metadata up front.

jakirkham · 2018-06-28T17:17:15Z

Having some read-only option(s) at some level(s) makes sense.

Knowing a bit more about where you plan to use this would be helpful.

alimanfoo · 2018-06-28T21:24:18Z

@jakirkham the original motivation for this comes from pangeo-data/pangeo#309. I think this is likely to be a common issue for other pangeo use cases, where existing data are converted to zarr format on some local system then uploaded to cloud object storage for broader (read-only) consumption. It affects pangeo particularly because they use xarray, and xarray needs to read in all metadata for a given hierarchy up front. The latency in listing directories and reading all metadata files from cloud object storage is causing considerable delays.

alimanfoo · 2018-06-28T21:27:29Z

...so the proposed solution is that data are converted to zarr on a local system as before, then an additional command is run to consolidate all metadata into a single file, then everything is uploaded into cloud storage, then pangeo users somehow configure their code to read metadata from the consolidated metadata object to speedup opening a dataset via xarray.

alimanfoo · 2018-06-28T21:49:29Z

@martindurant thanks for the thoughts. I don't have a concrete suggestion at the moment, but as we discuss options I think it could be useful to have in mind one of the design goals for zarr, which is that in general everything should work well in a setting where multiple concurrent threads or processes may be making modifications simultaneously.

I think this is basically the point @jakirkham was making when he asked "What if multiple pieces of metadata change at the same time?" I think the answer is, when using consolidated metadata, we raise some kind of exception on any attempt to make metadata changes.

jakirkham · 2018-06-29T14:50:03Z

Thanks for the context, @alimanfoo. Will think about this a bit more.

martindurant · 2018-07-02T16:10:28Z

@jakirkham , eager to hear your thoughts. This kind of metadata shortcut could be put on the read-only path only, I suppose, or explicitly opt-in.
It'd be most convenient from zarr's point of view if this can be something optional, but it'd be most convenient from xarray/user's point of view if it's automatic and no extra code elsewhere is needed.

Again, this is for example only, not intended final structure

martindurant · 2018-07-30T15:01:59Z

After some time has passed, the conversation here has run dry. A brief summary.

The situation remains, that it would be convenient to be able to store zarr metadata in a single top-level entity within a directory structure, to avoid expensive matadata lookups when investigating a zarr group's structure - a problem for xarray data on cloud services. The scenario here is the write-once, read-many situation, although the prospect of having to re-sync metadata following changes to the data structure is one to consider.

In the ongoing conversations around conventions and metadata, I feel there is a wish to make any changes optional and so compatible. An extra file, as in this WIP, would work, but feel very ad-hoc. Adding to the attrs would work very similarly, but the metadata of group contents doesn't feel like an attr. Adding something to the .xgroup would break compatibility. None of these by themselves would solve the sync problem.

The sync problem can be partially solved by simple means, such as: checking for and reading from consolidated metadata can only happen when read-only, and opening in any write mode deletes the metadata. This does not prevent changes to lower-levels in the hierarchy, though, since zarr can access them directly; xarray cannot do that, and so there is an argument that this logic belongs in xarray.

alimanfoo · 2018-07-30T17:10:58Z

Thanks Martin. FWIW I think there are or will be people wanting to use zarr in the cloud but not via xarray, so something to consider. (E.g., we've just today got our own pangeo-but-for-malaria-genomics up on GKE, we use zarr and dask but not currently xarray and expect to hit the metadata issue at some point.) What about something like the following... Zarr implements a function to consolidate metadata and store, pretty much just as you have implemented. E.g., calling: zarr.consolidate_metadata(store, key='.zmetadata', path=None) ...will consolidate all zarr metadata found in store, optionally under path, and put the consolidated metadata back into the store under the given key. Zarr then implements a store class that understands consolidated metadata. E.g.: base_store = zarr.DirectoryStore('/path/to/data') # or could be any underlying mapping class store = zarr.StoreWithConsolidatedMetadata(base_store, key='.zmetadata', path=None) (Class name is obviously horribly too long, but just a placeholder for the moment.) ...then uses this to open a group, e.g.: root = zarr.Group(store=store) I.e., all the logic of handling the consolidated metadata is encapsulated within the StoreWithConsolidatedMetadata class. Internally it could load the consolidated metadata, then implement some kind of fall back whereby keys are first looked up in consolidated metadata, but if not found are then attempted to be looked up in the underlying base store. If a package like xarray wants to make this even easier for the user, they could implement some check for presence of .zmetadata key and do this setup for the user. But the basic functionality is available without xarray. Also just to note IMO this solution does not require any change to the storage spec, as the storage spec only requires that a key/value (i.e., mapping) interface is presented to zarr. The details of how keys and values are stored behind the mapping interface is entirely up to the implementation. I.e., if using a file system, keys and values do not have to correspond to file names and file contents. Similarly for cloud storage.

…

On Monday, 30 July 2018, Martin Durant ***@***.***> wrote: After some time has passed, the conversation here has run dry. A brief summary. The situation remains, that it would be convenient to be able to store zarr metadata in a single top-level entity within a directory structure, to avoid expensive matadata lookups when investigating a zarr group's structure - a problem for xarray data on cloud services. The scenario here is the write-once, read-many situation, although the prospect of having to re-sync metadata following changes to the data structure is one to consider. In the ongoing conversations around conventions and metadata, I feel there is a wish to make any changes optional and so compatible. An extra file, as in this WIP, would work, but feel very ad-hoc. Adding to the attrs would work very similarly, but the metadata of group contents doesn't feel like an attr. Adding something to the .xgroup would break compatibility. None of these by themselves would solve the sync problem. The sync problem can be partially solved by simple means, such as: checking for and reading from consolidated metadata can only happen when read-only, and opening in any write mode deletes the metadata. This does not prevent changes to lower-levels in the hierarchy, though, since zarr can access them directly; xarray cannot do that, and so there is an argument that this logic belongs in xarray. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#268 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QnB6Nb2jD9C0W0-7r7i7X1qKFMoHks5uLx_ogaJpZM4U4pmx> .

-- If I do not respond to an email within a few days, please feel free to resend your email and/or contact me by other means. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo>

martindurant · 2018-07-30T23:28:20Z

Are you suggesting that the new Consolidated class should live in zarr, and that calling an open function with some keyword would activate its usage? I think that makes sense.
I wouldn't like to go the other way and to have to force external libraries to choose how they will do their mapping, or for something like the gcsfs mapper to have to subclass from zarr to get he right behaviour. (currently they only need to be MutibleMappings or any dict) I'm still not sure any of this solves the sync problem, because you can always open a directory without the consolidated class in the way, or simply point to a lower level in the structure. Caveat emptor, I suppose... we don't want to be putting datestamp of checksums in here!

alimanfoo · 2018-07-31T08:07:02Z

I haven't worked this fully through yet. But I was thinking something like the following (naming of new functions, classes and arguments subject to discussion) ...

User A wants to use zarr in the cloud in a write-once-read-many style. They are not using xarray. First they create the data in the cloud, e.g.:

base_store = gcsfs.GCSMap(...)
root = zarr.group(store=base_store)
# create sub-groups and arrays under root, put data into arrays, etc.

When they're finished writing to root, they consolidate the metadata with an explicit call, e.g.:

zarr.consolidate_metadata(base_store, key='.zmetadata')

Later, when they want to read the data, they do e.g.:

base_store = gcsfs.GCSMap(...)
store = zarr.StoreWithConsolidatedMetadata(base_store, key='.zmetadata')
root = zarr.Group(store=store)
baz = root['foo/bar/baz']
# read data etc.

In practice the key='.zmetadata' argument could be omitted by the user because it is the default, but showing here to be explicit.

The outer store could be read-only for simplicity, or could allow writes which are passed through to the underlying base_store, but without updating the consolidated metadata (which would require another explicit call to consolidate_metadata()).

User B is using xarray, and is copying data from some NetCDF4 files into zarr on local disk first, then copying files up to the cloud, then using xarray to read the data. E.g., first copy from NetCDF4 to zarr locally:

root = xarray.open_dataset('/local/path/to/data.nc')
root.to_zarr('/local/path/to/data.zarr', consolidate=True, metadata_key='.zmetadata')

...then copy files up to GCS, then to read from GCS do:

store = gcsfs.GCSMap(...)
root = xarray.open_zarr(store, consolidated=True, metadata_key='.zmetadata')

Again the metadata_key='.zmetadata' could be omitted because it is default, but showing for completeness.

There's probably also a use-case to account for involving user making dask API calls using from_zarr() and to_zarr(), haven't thought that through yet.

What do you think about the basic approach?

alimanfoo · 2018-07-31T09:10:33Z

On second thoughts, what if the zarr public API is just like this. One function to explicitly consolidate metadata:

zarr.consolidate_metadata(store=base_store, key='.zmetadata')

...and one function to open a group with consolidated metadata:

root = zarr.open_consolidated(store=base_store, key='.zmetadata')

All other details of how consolidation is handled are hidden, i.e., not part of the public API.

Is root (return value of zarr.open_consolidated()) read-only? I suggest no. I.e., changes can be made, including modifying data in arrays, and creating new groups and arrays. All changes are written through to the base store. However, changes to metadata (e.g., creating new groups and arrays) require an explicit call to zarr.consolidate_metadata() to update the consolidated metadata.

martindurant · 2018-08-01T20:48:01Z

Perhaps even simpler?

root = zarr.open(store=base_store, consolidated_key='.zmetadata')

Then again, if a change is requires in xarray (and elsewhere) to use the consolidated store, then could as well have the separate function. However, would want some way to "use consolidated if available", and I'm assuming you wouldn't want to file extra keywords into the base open function. Some of my code has suffered from this (fastparquet.write).

For the implementation as far as the wrapper is concerned and the read-only question, I think I agree with you.

martindurant · 2018-08-02T16:28:17Z

@alimanfoo , I implemented your suggested over-layer class. This is optional.
There still could be a flag in open() to attempt to use this, or it could be left to external libraries like xarray to opt in.

martindurant · 2018-08-02T17:26:14Z

In addition, I could imagine enabling writing in the class, by starting with an empty dict if the metadata key doesn't exist yet, have metadata writes affect both that dict and the backend store, and having some "flush" method to write the current state of the metadata dict. Then, maybe you wouldn't need to call the consolidate function explicitly.

alimanfoo · 2018-08-03T16:45:03Z

Thanks @martindurant for moving this forward. Unfortunately I'm offline now for 3 weeks and have run out of time to give any feedback, but hopefully others can comment, and I'll be very happy to push on getting this into master when I'm back.

alimanfoo · 2018-11-01T22:47:52Z

OK, this looks all good to me. Any objections to merging?

jakirkham · 2018-11-01T22:58:36Z

Not from me.

It would be nice if at least one of the big users of this functionality gave it a quick look. Perhaps @rabernat or @jhamman?

martindurant · 2018-11-02T00:08:37Z

green!

mrocklin · 2018-11-02T16:45:39Z

cc @jacobtomlinson as well

rabernat · 2018-11-02T20:11:26Z

I just tried consolidate_metadata on one of my existing stores and it appeared to work. Then I tried open_consolidated and it also worked.

The one weird thing is that the .zmetadata file was encoded with escaped newline characters \n, so that it is all one single really long line. This makes it hard to view and edit with a text editor. This did not affect the actual functionality, but I feel it should be fixed to preserve human readability of the metadata.

alimanfoo · 2018-11-02T22:09:15Z

The one weird thing is that the .zmetadata file was encoded with escaped newline characters \n, so that it is all one single really long line. This makes it hard to view and edit with a text editor. This did not affect the actual functionality, but I feel it should be fixed to preserve human readability of the metadata.

Really interesting you brought that up. I had thought the same, e.g., the consolidated metadata file could include the consolidated files as JSON objects, rather than as strings of escaped JSON (hope that makes sense, happy to clarify if not). The only reason I didn't immediately suggest that was a quirk of the way the architecture is currently structured would mean that the JSON objects would need to go through an extra serialisation and deserialisation, which is somewhat anathema to the spirit of efficiency and might have a performance impact (although could be negligible, haven't measured it). Bottom line, if you and others think it would be valuable to make the consolidated metadata file more human readable/editable - which is very much in the spirit of zarr data being very "hackable" - I'd be happy to unpack what I said above and explore ways of working around current architecture limitations. —

…

You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#268 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QriXYjpK_p2LL7ZzG2ESkHOTPVKCks5urK3PgaJpZM4U4pmx> .

alimanfoo · 2018-11-03T21:14:04Z

Thinking a bit more about @rabernat's comment, I realised there was a fairly straightforward way to workaround the technical issues I mentioned above and implement a format for the consolidated metadata that's a bit easier to read/edit. I've pushed the changes in commit 9c0c621 but very happy to discuss and revert if anything looks off. Here's an example of the new format:

>>> import zarr
>>> store = dict()
>>> z = zarr.group(store)
>>> z.create_group('g1')
<zarr.hierarchy.Group '/g1'>
>>> g2 = z.create_group('g2')
>>> g2.attrs['hello'] = 'world'
>>> arr = g2.create_dataset('arr', shape=(20, 20), chunks=(5, 5), dtype='f8')
>>> arr.attrs['data'] = 1
>>> arr[:] = 1.0
>>> zarr.consolidate_metadata(store)
<zarr.hierarchy.Group '/'>
>>> print(store['.zmetadata'].decode())
{
    "metadata": {
        ".zgroup": {
            "zarr_format": 2
        },
        "g1/.zgroup": {
            "zarr_format": 2
        },
        "g2/.zattrs": {
            "hello": "world"
        },
        "g2/.zgroup": {
            "zarr_format": 2
        },
        "g2/arr/.zarray": {
            "chunks": [
                5,
                5
            ],
            "compressor": {
                "blocksize": 0,
                "clevel": 5,
                "cname": "lz4",
                "id": "blosc",
                "shuffle": 1
            },
            "dtype": "<f8",
            "fill_value": 0.0,
            "filters": null,
            "order": "C",
            "shape": [
                20,
                20
            ],
            "zarr_format": 2
        },
        "g2/arr/.zattrs": {
            "data": 1
        }
    },
    "zarr_consolidated_format": 1
}

martindurant · 2018-11-03T21:27:01Z

I'm in favour, it looks nice.

jakirkham · 2018-11-05T19:45:12Z

zarr/convenience.py

+        return (key.endswith('.zarray') or key.endswith('.zgroup') or
+                key.endswith('.zattrs'))
+
+#    out = {key: store[key].decode() for key in store if is_zarr_key(key)}


Should we drop this line?

Yep, thanks for the catch.

rabernat · 2018-11-13T19:55:39Z

lgtm?

alimanfoo · 2018-11-14T13:03:26Z

Alright, this one is going in!

rabernat · 2018-11-18T06:37:50Z

Does anyone have an example of using this new feature with xarray? I'm not able to get it to work.

What I'm doing. (Not sure if this is the intended usage.)

import xarray as xr
import gcsfs
import zarr
path = 'pangeo-data/newman-met-ensemble'
store = gcsfs.GCSMap(path)
zs_cons = zarr.storage.ConsolidatedMetadataStore(store)
ds_orig = xr.open_zarr(store)
ds_cons = xr.open_zarr(zs_cons, decode_times=False)

The array data in the consolidated metadata is mangled compared to the original. Also, possibly related, store['time/.zarray'] returns bytes while zs_cons['time/.zarray'] returns a dict.

What is the recommended way to open my newly consolidated store from xarray?

alimanfoo · 2018-11-19T09:55:40Z

I think xarray.open_zarr() would need some minor modification to support consolidated metadata. At least a few options I could see...

(1) Low level solution. Add support for a chunk_store argument in xarray.open_zarr(). Then user could do:

store = gcsfs.GCSMap(path)
cons = zarr.storage.ConsolidatedMetadataStore(store)
ds = xr.open_zarr(store=cons, chunk_store=store)

(2) Higher-level solution. Add a consolidated=False argument to xarray.open_zarr(). If False get current behaviour, delegate to zarr.open_group(). If True delegate to zarr.open_consolidated(). So then user would do:

store = gcsfs.GCSMap(path)
ds = xr.open_zarr(store=cons, consolidated=True)

(3) Auto-detect. Don't change xarray.open_zarr() signature. Instead look for presence of `.zmetadata' key in store, if present delegate to zarr.open_consolidated(). No change to current user code.

Also not sure how this all interacts with possibility to add support for consolidated metadata in intake, @martindurant?

martindurant · 2018-11-19T14:40:53Z

I prefer scenario (2), where it is user choice (or an argument in an intake catalog), since this is still an experimental feature, but no extra lines of code.

rabernat · 2018-11-19T14:44:41Z

See pydata/xarray#2558

rabernat · 2018-11-28T16:08:53Z

FYI, the consolidated API for xarray is being discussed here: pydata/xarray#2559 (comment)

Would welcome input.

jakirkham · 2018-12-10T18:22:32Z

zarr/core.py

@@ -165,6 +165,9 @@ def _load_metadata_nosync(self):
            if config is None:
                self._compressor = None
            else:
+                # temporary workaround for
+                # https://github.com/zarr-developers/numcodecs/issues/78
+                config = dict(config)


Reverting in PR ( #361 ) as this was fixed in Numcodecs 0.6.0 with PR ( zarr-developers/numcodecs#79 ). As we now require Numcodecs 0.6.0+ in Zarr, we get the fix and thus no longer need the workaround.

POC of making a single file out of zarr dot files

8301fa6

martindurant mentioned this pull request Jun 28, 2018

xesmf included in pangeo.pydata.org ? pangeo-data/pangeo#309

Closed

(WIP) include simple code that would load metadata

be6d706

Again, this is for example only, not intended final structure

martindurant mentioned this pull request Jul 27, 2018

add gmet zarr dataset to intake cat pangeo-data/pangeo#341

Merged

martindurant mentioned this pull request Aug 2, 2018

AZURE deployment pangeo-data/pangeo#82

Closed

Implement ConsolidatedMetadataStore

f1128ff

fix for py34 py35

6666391

Martin Durant added 2 commits August 2, 2018 13:34

improve coverage; data write in consolidated store

a369073

coverage

96e1fb0

dazzag24 mentioned this pull request Aug 3, 2018

Quick fix for dir_path and getsize for Azure Blob tjcrone/zarr#1

Closed

alimanfoo added 3 commits November 1, 2018 17:15

fix requirements

8acf83a

skip consolidate doctests; minor edits

2f89535

fix refs [ci skip]

c8ed0f6

make consolidated metadata human-readable

9c0c621

jakirkham reviewed Nov 5, 2018

View reviewed changes

alimanfoo mentioned this pull request Nov 6, 2018

Zarr/GCS potential optimisation to reduce latency pangeo-data/pangeo#381

Closed

comments [ci skip]

ccef26c

alimanfoo merged commit d193a78 into zarr-developers:master Nov 14, 2018

alimanfoo added this to the v2.3 milestone Nov 14, 2018

alimanfoo mentioned this pull request Nov 14, 2018

Rework requirements for pyup.io #326

Merged

4 tasks

jakirkham mentioned this pull request Dec 10, 2018

Drop temporary workaround for get_codec #361

Merged

7 tasks

jakirkham reviewed Dec 10, 2018

View reviewed changes

ryan-williams mentioned this pull request Feb 27, 2020

Protocol extensions zarr-developers/zarr-specs#49

Open

DennisHeimbigner mentioned this pull request Apr 20, 2021

.zmetadata clarification #720

Closed

rabernat mentioned this pull request May 13, 2021

NCZarr - Netcdf Support for Zarr zarr-developers/zarr-specs#41

Open

Consolidate zarr metadata into single key #268

Consolidate zarr metadata into single key #268

Conversation

martindurant commented Jun 26, 2018 • edited by alimanfoo Loading

alimanfoo commented Jun 26, 2018

jakirkham commented Jun 27, 2018

martindurant commented Jun 27, 2018

alimanfoo commented Jun 28, 2018

martindurant commented Jun 28, 2018

jakirkham commented Jun 28, 2018

alimanfoo commented Jun 28, 2018

alimanfoo commented Jun 28, 2018

alimanfoo commented Jun 28, 2018

jakirkham commented Jun 29, 2018

martindurant commented Jul 2, 2018

martindurant commented Jul 30, 2018

alimanfoo commented Jul 30, 2018 via email

martindurant commented Jul 30, 2018

alimanfoo commented Jul 31, 2018

alimanfoo commented Jul 31, 2018

martindurant commented Aug 1, 2018 • edited Loading

martindurant commented Aug 2, 2018

martindurant commented Aug 2, 2018

alimanfoo commented Aug 3, 2018

alimanfoo commented Nov 1, 2018

jakirkham commented Nov 1, 2018

martindurant commented Nov 2, 2018

mrocklin commented Nov 2, 2018

rabernat commented Nov 2, 2018

alimanfoo commented Nov 2, 2018 via email

alimanfoo commented Nov 3, 2018

martindurant commented Nov 3, 2018

jakirkham Nov 5, 2018

Choose a reason for hiding this comment

alimanfoo Nov 6, 2018

Choose a reason for hiding this comment

rabernat commented Nov 13, 2018

alimanfoo commented Nov 14, 2018

rabernat commented Nov 18, 2018

alimanfoo commented Nov 19, 2018

martindurant commented Nov 19, 2018

rabernat commented Nov 19, 2018

rabernat commented Nov 28, 2018

jakirkham Dec 10, 2018

Choose a reason for hiding this comment

martindurant commented Jun 26, 2018 •

edited by alimanfoo

Loading

martindurant commented Aug 1, 2018 •

edited

Loading