Initialise zarr metadata without computing dask graph #6084

dougiesquire · 2021-12-17T21:17:42Z

Is your feature request related to a problem? Please describe.
On writing large zarr stores, the xarray docs recommend first creating an initial Zarr store without writing all of its array data. The recommended approach is to first create a dummy dask-backed Dataset, and then call to_zarr with compute=False to write only metadata to Zarr. This works great.

It seems that in one common use case for this approach (including the example in the above docs), the entire dataset to be written to zarr is already represented in a Dataset (let's call this ds). Thus, rather than creating a dummy Dataset with exactly the same metadata as ds, it is more convenient to initialise the zarr Store with ds.to_zarr(..., compute=False). See for example:

https://discourse.pangeo.io/t/many-netcdf-to-single-zarr-store-using-concurrent-futures/2029
https://discourse.pangeo.io/t/map-blocks-and-to-zarr-region/2019
https://discourse.pangeo.io/t/netcdf-to-zarr-best-practices/1119/12
https://discourse.pangeo.io/t/best-practice-for-memory-management-to-iteratively-write-a-large-dataset-with-xarray/1989

However, calling to_zarr with compute=False still computes the dask graph for writing the Zarr store. The graph is never used in this use-case, but computing the graph can take a really long time for large graphs.

Describe the solution you'd like
Is there scope to add an option to to_zarr to initialise the store without computing the dask graph? Or perhaps an initialise_zarr method would be cleaner?

The text was updated successfully, but these errors were encountered:

shoyer · 2021-12-21T00:00:49Z

The challenge is that Xarray needs some way to represent the "schema" for the desired entire dataset. I'm very open to alternatives, but so far, the most convenient way to do this has been to load Dask arrays into an xarray.Dataset.

It's worth noting that any dask arrays with the desired chunking scheme will do -- you don't need to use the same dask arrays that you want to compute. When I do this sort of thing, I will often use xarray.zeros_like() to create low overhead versions of dask arrays, e.g., in this example from Xarray-Beam:
https://github.com/google/xarray-beam/blob/0.2.0/examples/era5_climatology.py#L61-L68

dcherian · 2021-12-24T03:17:44Z

What metadata is being determined by computing the whole array?

dougiesquire · 2022-01-12T20:35:44Z

Thanks @shoyer. I understand the need for the schema, but is there a need to actually generate the dask graph when all the user wants to do is initialise an empty zarr store? E.g., I think skipping this line would save some of the users in my original post a lot of time.

Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the xarray.zeros_like(ds) recommendation to the docs?

shoyer · 2022-01-12T21:05:59Z

E.g., I think skipping this line would save some of the users in my original post a lot of time.

I don't think that line adds any measurable overhead. It's just telling dask to delay computation of a single function.

For sure this would be worth elaborating on in the Xarray docs! I wrote a little bit about this in the docs for Xarray-Beam: see "One recommended pattern" in https://xarray-beam.readthedocs.io/en/latest/read-write.html#writing-data-to-zarr

sehoffmann · 2024-04-03T11:15:13Z

As a user, I find the topic very unclear and would hope that were to be a clear and concise way to do this in the future.
In essence, I should just be expected to pass in my coords, dims, shapes and chunks and let xarray handle the rest. Similar to how the xr.Dataset or xr.DataArray ctor needs this meta information from me already as well. This could also be an extra function such as xr.create_zarr(shapes, dims, coords, chunks) for instance.

In general, incrementally writing a zarr array with xarray seems to be very convoluted in my opinion, especially if compared with the actual python zarr API.

@dougiesquire

Regardless, your suggestion to just create a low-overhead version of the array being initialised is probably better/cleaner than adding a specific option or method. Would it be worth adding the xarray.zeros_like(ds) recommendation to the docs?

zeros_like() allocates memory, doesn't it? Same with empty_like. If you are in the business of incrementally writing data to disk, its usually because memory is not big enough anymore. So this unfortunately not an option for me.

max-sixty · 2024-04-03T19:08:24Z

FYI I think #8460 should solve most of this. Or would anything remain?

zeros_like() allocates memory, doesn't it? Same with empty_like. If you are in the business of incrementally writing data to disk, its usually because memory is not big enough anymore. So this unfortunately not an option for me.

If we create the array with chunks, then it doesn't allocate memory! There's more context in the linked PR / some links from there...

TomNicholas added the topic-zarr Related to zarr storage library label Dec 19, 2021

jhamman added the topic-backends label Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialise zarr metadata without computing dask graph #6084

Initialise zarr metadata without computing dask graph #6084

dougiesquire commented Dec 17, 2021

shoyer commented Dec 21, 2021

dcherian commented Dec 24, 2021

dougiesquire commented Jan 12, 2022

shoyer commented Jan 12, 2022

sehoffmann commented Apr 3, 2024 •

edited

Loading

max-sixty commented Apr 3, 2024

Initialise zarr metadata without computing dask graph #6084

Initialise zarr metadata without computing dask graph #6084

Comments

dougiesquire commented Dec 17, 2021

shoyer commented Dec 21, 2021

dcherian commented Dec 24, 2021

dougiesquire commented Jan 12, 2022

shoyer commented Jan 12, 2022

sehoffmann commented Apr 3, 2024 • edited Loading

max-sixty commented Apr 3, 2024

sehoffmann commented Apr 3, 2024 •

edited

Loading